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RESEARCH  OVERVIEW 


The  research  vehicle  for  this  contract  is  the  largest  possible  computer  that  could  be  conceived  for  the 
mid  to  late  1990s.  The  technical  challenges  of  such  a  machine  serve  as  the  guiding  stimulus  for  the  research 
carried  out  and  reported  here. 

We  imagine  this  machine  to  occupy  a  14-story  building,  to  cost  upwards  of  $1,000,000,000,  and  to  be  so 
colossal  that  the  nation  can  only  afford  one  or  two  of  them.  The  available  chip  technology  and  machine  size  are 
consistent  with  a  million  billion  FLOPS  (that’s  10  to  the  15th)  and  a  million  billion  Bytes  of  memory.  It  will 
dissipate  50  megawatts  of  power  using  CMOS  technology.  Communication  across  the  machine  will  be  much 
slower  than  computation  at  a  node.  The  architecture,  software,  interconnect  technology,  packaging,  and 
operating  system  are  unknown. 

This  investigation  deals  with  hardware  technology,  software  techniques,  programming  algorithms, 
communications,  processing  elements,  and  applications.  The  study  will  determine  the  plausibility  (not 
feasibility)  of  such  a  machine.  Progress  in  these  various  areas  are  highlighted  in  the  individual  sections  below. 
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CIRCUITS 

A  returning  faculty  member,  Prof.  Thomas  F.  Knight,  Jr.,  has  taken  over  this  aspect  of  the  work 
following  the  departure  of  Prof.  Lance  A.  Glasser.  Some  immediate  plans  for  this  activity  are  outlined  below. 

We  will  design  and  construct  a  low  latency  processor  to  processor  communication  switch  which  uses 
innovative  ideas  at  the  architectural,  silicon,  and  packaging  levels  to  reduce  communications  delays.  At  the 
architectural  level,  the  impact  of  simple  source-responsible  routing  protocols  on  the  development  of  fault 
tolerant,  extremely  simple  routing  elements  will  be  investigated.  The  reduction  in  complexity  of  the  routing  part 
contributes  to  its  speed.  At  the  circuit  level,  high  speed  chip  to  chip  communication  techniques  such  as 
transistor  series  terminated  drivers  will  be  investigated,  as  well  as  some  ideas  for  using  high  speed  microwave 
modems  to  communicate  in  a  non-baseband  environment  In  packaging,  the  objective  includes  a  liquid  cooled, 
dense,  three-dimensional  second  level  wiring  technology,  with  almost  isotropic  wire  density  in  all  three 
dimensions. 


PROCESSING  ELEMENTS 


Prof.  Dally  and  his  students  have  made  significant  progress  in  development  of  processing  elements  and 
associated  communications  circuits. 

We  developed  a  deadlock-free  adaptive  routing  algorithm  for  k- ary  n-cube  networks.  This  algorithm  will 
be  used  to  implement  fault-tolerant  networks. 

We  completed  an  analysis  and  simulation  study  of  network  performance  using  different  flow  control 
strategies.  This  study  showed  that  adaptive  routing  does  not  significantly  improve  performance.  A  flow  control 
disapline  that  permits  messages  to  pass  one  another  is  needed  to  improve  performance  further. 

In  the  laboratory  we  demonstrated  a  number  of  our  network  and  arithmetic  concepts  in  three  prototype 
chips:  the  NDF  router,  the  RAP  arithmetic  chip,  and  a  high-bandwidth  memory  chip. 

A  computer  the  size  of  the  American  Resource  Computer  will  require  the  ability  to  change  state  rapidly 
to  hide  transmission  latency  without  sacrificing  single-thread  performance.  We  are  working  on  an  architecture 
for  a  named  state  processor  that  explicitly  binds  names  to  registers.  This  mechanism  combines  the  advantages 
of  multi-threading  and  multiple  register  sets  for  implementing  fast  context  switches  and  procedure  calls. 

We  are  investigating  programming  strategies  for  very  large  numbers  of  processors  based  on  agents  and 
agencies  (Minsky,  Society  of  the  Mind).  We  are  planning  to  use  agencies  to  implement  concurrency 
abstractions  for  naming  and  information  sharing 

We  are  investigating  the  application  of  a  computer  of  the  scale  of  the  American  Resource  Computer  to 
database  applications.  The  issues  involved  include  data  partitioning,  methods  for  insuring  stability  and 
persistence,  concurrency  control,  and  efficient  algorithms  for  search  and  update. 


COMMUNICATIONS  TOPOLOGY  AND  ROUTING  ALGORITHMS 


Charles  E.  Leiserson  is  currently  on  leave  at  Thinking  Machines  Corporation,  through  December  1988. 
On  his  return  to  MIT,  he  plans  to  continue  his  investigations  into  parallel  computation,  focusing  principally  on 
issues  related  to  timing,  synchronization,  fault  tolerance,  and  routing  algorithms.  He  also  expects  to  complete  a 
textbook,  entitled  Introduction  to  Algorithms,  coauthored  with  Thomas  H.  Cormen  and  Ronald  L.  Rivest. 

Guy  Blelloch  is  finishing  his  thesis,  titled  “Scan  Primitives  and  Parallel  Vector  Models.”  The  thesis 
suggests  a  new  class  of  algorithmic  models  for  parallel  computing.  These  models  are  based  on  a  set  of 
operations  on  vectors  of  atomic  values.  The  thesis  shows  how  the  models  can  be  used  for  algorithm  design,  how 
they  can  be  implemented  on  various  computers,  and  how  they  can  be  used  as  the  back  end  of  a  compiler  for 
high-level  languages.  The  thesis  also  suggests  that  a  set  of  scan  operations  should  be  considered  primitive 
parallel  operations.  Next  year  he  will  be  an  Assistant  Professor  at  Carnegie  Mellon  University. 

Tom  Cormen  has  concentrated  on  writing  the  textbook  Introduction  to  Algorithms  with  Professors 
Leiserson  and  Rivest.  The  textbook  includes  several  chapters  on  parallel  algorithms  and  circuitry.  The  writing 
should  be  completed  by  the  end  of  1988. 

Jeff  Fried  is  currently  working  on  a  number  of  problems  to  the  impact  of  synchrony  on  the  performance 
of  distributed  algorithms.  He  is  also  working  on  the  architecture  and  blocking  analysis  of  sparse  circuit- 
switched  interconnection  networks.  His  most  significant  results  relate  to  the  design  of  VLSI  processors  for  use 
within  the  interconnection  networks  found  in  telecommunications,  distributed  computing,  and  parallel 
processing. 

Ron  Greenberg  and  Mike  Foster  of  Columbia  have  established  matching  lower  and  upper  bounds  for 
the  area  required  to  implement  finite-state  machines  in  VLSI.  In  addition  Greenberg  has  continued  work  on 
the  subject  of  universal  routing  networks  for  parallel  computation.  Greenberg  and  Leiserson’s  paper  on 
compact  layout  of  the  three-dimensional  tree  of  meshes  provides  a  simple  proof  of  results  used  to  establish 
upper  bounds  on  the  penalty  paid  for  using  a  general  network  to  simulate  all  parallel  machines,  and  Greenberg 
is  currently  making  progress  on  lower  bound  results. 

Greenberg  and  Alexander  Ishii  have  also  been  working  with  Alberto  Sangiovanni-Vincentelli  of  Berkeley 
on  a  multi-layer  channel  router  for  VLSI  circuits,  called  MulCh.  While  based  on  the  Chameleon  system 
developed  at  Berkeley,  MulCh  incorporates  the  additional  feature  that  nets  may  be  routed  entirely  on  a  single 
interconnect  layer  (Chameleon  requires  the  vertical  and  horizontal  sections  of  a  net  be  routed  on  different 
interconnect  layers).  When  used  on  sample  problems,  MulCh  shows  significant  improvements  over 
Chameleon  in  area,  total  wire  length,  and  via  count. 

Ishii  has  completed  his  masters  thesis,  which  describes  his  models  for  VLSI  timing  analysis.  The  model 
maps  continuous  data-domains,  such  as  voltage,  into  discrete,  or  digital,  data  domains,  while  retaining  a 
continuous  notion  of  time.  The  majority  of  the  thesis  concentrates  on  developing  lemmas  and  theorems  that 
can  serve  as  a  set  of  “axioms”  when  analyzing  algorithms  based  on  the  model.  Key  axioms  include  the  fact  that 
circuits  in  our  model  generate  only  well  defined  digital  signals,  and  the  fact  that  components  in  our  model 
support  and  accurately  handle  the  “undefined”  values  that  electrical  signals  must  take  on  when  they  make  a 
transition  between  valid  logic  levels.  In  order  to  facilitate  proofs  for  circuit  properties,  the  class  of 
computational  predicates  is  defined.  A  circuit  property  can  be  proved  by  simply  casting  the  property  as  a 
computational  predicate. 

Ishii  has  also  been  working  with  Bruce  Maggs  on  a  new  VLSI  design  for  a  high-speed  multi-port  register 
file.  Design  goals  include  short  cycle-time  and  single-cycle  register  window  context  changes.  This  research 
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began  as  an  advanced  VLSI  class  project,  under  the  supervision  of  Prof.  Knight  of  the  MIT  Artificial 
Intelligence  Laboratory. 

Janies  K.  Park  is  currently  collaborating  with  Alok  Aggarwal  and  Dina  Kravets  on  a  number  of  problems 
in  Computational  Geometry.  He  is  also  working  with  Bruce  Maggs  on  the  problem  of  finding  an  optimal  offline 
deterministic  routing  algorithm  for  the  butterfly-fat-tree.  Park's  most  significant  contribution  of  the  past  year 
was  his  work  with  Aggarwal  on  monotone  arrays. 

Cindy  Phillips  and  Charles  Leiserson  extended  the  graph  contraction  results  of  Phillips  to  lead  to 
0(lg  n  lg2y  ) -time  randomized  algorithms  for  finding  the  connected  components  (and  related  problems) 
of  n-node  bounded  degree  graphs  where  7  is  the  maximum  genus  of  any  connected  component.  With  Guy 
Blelloch  and  (from  Thinking  Machines  Corp)  Ajit  Agrawal  and  Robert  Krawitz,  Phillips  investigated  primitives 
for  efficiently  manipulating  dense  matrices  in  massively  parallel  hypercube  architectures  where  many  matrix 
elements  must  be  mapped  to  a  single  processor. 

Serge  A.  Plotkin  finished  his  Ph.D.  thesis  entitled  Graph-Theoretic  Techniques  for  Parallel  Distributed, 
and  Sequential  Computation.  His  thesis  includes  the  following  results: 

•  A  novel  algorithm  for  symmetry  breaking  in  distributed  and  parallel  computing  environments  that  runs 
in  0(lg*n)  time. 

•  A  new  atomic  data  object,  called  a  Sticky  Bit.  A  polynomial  number  of  Sticky  Bits  are  sufficient  to 
convert  a  safe,  implementation  of  an  arbitrary  sequential  object  into  an  atomic  one  in  a  shared-memory 
multiprocessing  environment. 

•  An  algorithm  for  managing  a  global  resource  in  a  distributed  network.  In  particular,  the  algorithm 
allows  a  resource  used  by  a  protocol  of  n  processors  to  be  managed  with  only  amortized  0  ( lg2n)  message 
overhead. 

•  A  parallel  algorithm  for  solving  the  minimum  spanning  tree  problem  on  a  n-by-n  mesh-connected 
computer  that  rims  in  0(n)  time.  The  algorithm  is. novel  because  it  is  based  on  reducing  the  minimum 
spanning  tree  problem  to  the  problem  of  finding  shortest  paths. 

•  The  first  sublinear-time  parallel  algorithm  for  bipartite  matching.  The  algorithm  runs  in 
0(n2/3lg3n)  time  on  a  graph  of  n  vertices,  and  can  be  generalized  to  solve  0-1  flow  problems,  both 
including  both  weighted  and  unweighted  versions. 

•  Two  sequential  algorithms  for  the  generalized  circulation  problem  (network  flow  with  losses  and 
gains)  which  are  the  first  polynomial-time  combinatorial  algorithms  for  this  problem.  One  algorithm  runs  in 
0(n2m2lg2n  lg  B)  time  and  the  other  runs  in  0(nVlg  n  lg2B)  time,  where  n  is  the  number  of 
nodes,  m  is  the  number  of  edges,  and  B  is  the  largest  integer  used  to  represent  capacities  and  gains,  where 
gains  are  represented  as  .ados  of  integers. 

Plotkin  has  assumed  a  postdoctoral  position  at  Stanford  University. 

Mark  Newman’s  interests  include  fault  tolerant  parallel  computation  and  efficient  procedures  for 
simulating  one  parallel  network  with  another.  During  the  past  year,  he  completed  work  with  Leighton  and 
Johan  Hastad  which  showed  how  a  hypercube  with  a  large  number  of  faulty  processors  and  communication 
paths  could  be  used  for  computation.  They  showed  that,  even  if  a  constant  fraction  of  the  hypercube’s 
components  fail,  the  cube  can  simulate  a  fully  functioning  hypercube  using  only  a  constant  factor  more  time.  In 
the  next  year,  Newman  plans  to  extend  the  results  for  faulty  hypercubes  to  other  networks  and  to  search  for 
efficient  graph  embeddings  which  aid  in  network  simulation. 
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SYSTEMS  SOFTWARE 

Studies  in  scalability  cf  large-scale  shared-memory  multiprocessors  focussing  on  the  use  of  locality  in 
various  forms  to  reduce  the  latency  of  memory  accesses.  A  major  part  of  the  work  headed  by  Prof.  Anant 
Agarwal  has  also  focussed  on  developing  better  data  collection  and  evaluation  techniques  for  multiprocessors. 

The  data  from  several  address  tracing  techniques  that  we  developed  for  both  symbolic  and  numeric 
computing  showed  that  parallel  programs  exhibit  a  significant  amount  of  locality,  and  that  this  locality  could  be 
successfully  exploited  by  caches  at  the  processor  level  to  provide  a  high  effective  memory  bandwidth  to  the 
processor.  An  evaluation  of  the  large-scale  interconnection  network  performance  of  both  hardware  cache 
coherence  (based  on  a  novel  directory  structure)  and  software  coherence  schemes  showed  that  the  hardware 
directory  scheme  could  perform  well  under  significant  sharing  levels,  while  the  software  schemes  could  be  relied 
upon  for  low  to  moderate  sharing  levels. 

Our  future  research  will  focus  on  two  aspects.  Investigating  high-performance  interconnection 
technology  for  large-scale  multiprocessing.  We  are  building  a  prototype  network  clocked  at  100MHz  that  will 
provide  an  average  memory  access  time  for  a  256-processor  system  of  less  than  200ns.  We  are  also  investigating 
how  locality  of  addressing  can  be  incorporated  into  the  network  and  to  what  extent  programs  can  exploit  the 
locality  in  the  network.  The  second  aspect  of  our  research  will  research  novel  techniques  of  synchronization 
such  a  barriers  and  semaphores  with  a  back-off  capability  to  reduce  network  traffic  by  minimizing  unnecessary 
spins  on  the  network,  and  do  a  detailed  design  of  the  directory  structure  required  to  maintain  cache  coherence 
on  a  large  scale.  Plus,  continued  work  on  network  and  cache  evaluation  techniques  and  multiprocessor  data 
collection  efforts. 
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ALGORITHMS 

Prof.  Leighton  and  his  students  have  discovered  a  very  efficient  randomized  algorithm  for  routing  in 
such  networks  as  the  hypercube,  butterfly  and  shuffle-exchange  graph  that  is  robust  in  the  sense  that  the  same 
algorithm  works  for  virtually  any  network  in  near-optimal  time  (e.g.,  even  in  arrays). 

They  have  also  discovered  an  entire  class  of  approximation  algorithms  for  layout  related  problems  in 
VLSI  such  as  graph  partitioning,  crossing  number  and  layout  area. 

In  addition,  they  have  discovered  efficient  embeddings  for  a  variety  of  useful  networks  in  the  hypercube 
and  butterfly.  Such  embeddings  are  useful  for  mapping  processes  to  processors  in  both  synchronous  and 
asynchronous  parallel  machines. 

Michelangelo  Grigni  drafted  “Tight  Bounds  on  Minimum  Broadcast  Networks”  with  David  Peleg  of  the 
Weizmann  Institute  (previously  Stanford). 

A  certain  class  of  recursively  structured  graphs  had  been  proposed  an  examples  of  graphs  which 
required  small  wire  area,  but  large  chip  area,  to  lay  out.  Mark  Hansen  disproved  this  conjecture  and 
demonstrated  that  this  class  of  graphs  have  chip  layout  area  equal  to  wire  area.  He  has  also  developed  some 
techniques  for  proving  lower  bounds  on  the  area  required  to  embed  rectangular  grids  in  square  grids. 

Richard  Koch  has  probabilistically  analyzed  a  routing  scheme  which  has  been  implemented  on  parallel 
architectures  based  on  the  butterfly  graph. 

Dina  Kravets  developed  algorithms  for  finding  all  farthest  neighbors  of  every  vertex  on  a  convex  n-gon  in 
9  (n)  time,  for  sorting  every  row  of  a  monotone  matrix  in  9  (n2)  time,  and  for  sorting  a  set  of  numbers  given 
ranges  of  ranks  in  9(n  log  Q/n)  where  Q  is  the  sum  of  the  ranges. 

Satish  Rao  and  Tom  Leighton  have  found  the  first  approximation  algorithms  for  the  problems  of  finding 
small  graph  separators,  VLSI  layout  and  crossing  number.  Leighton,  Maggs,  and  Rao  have  explored  solutions 
to  packet  routing  problems  with  fixed  congestion  and  dilation  in  LMR.  They  show  the  existence  of  a  constant 
overhead  schedule  for  such  problem, 

Eric  Schwabe  proved  a  general  lower  bound  showing  that  any  bounded-degree  network  which  can 
manage  m  local  priority-queue  memories  must  have  total  size  fl( m  log  m),  even  if  randomized  algorithms 
are  allowed.  This  lower  bound  can  be  achieved  —  meaning  it  is  a  network  and  algorithm  which  can  manage  m 
such  memories  in  0(m  log  m)  total  space.  As  a  side  result  of  the  techniques  used  in  this  algorithm,  Hansen 
developed  a  simple  algorithm  for  permutation  routing  of  n  messages  on  a  butterfly  network  deterministically 
and  on-line  in  9(  (log2n)/(log  log  n) )  steps. 
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APPLICATIONS 

Profs.  Jacob  White  and  Srinivas  Devadas  and  their  students  are  investigating  difficult  simulation  tasks  in 
an  effort  to  challenge  the  capabilities  of  the  American  Resource  Computer. 

In  this  past  year  we  proved  a  result  about  the  optimality  of  Gauss-Jacobi  over  Gauss-Seidel  on  parallel 
processors,  and  developed  a  banded  Gauss-Jacobi  relaxation  approach  to  simulating  circuits  that  is  fast  and 
reliable.  In  addition,  we  proved  several  new  results  about  the  uniformity  of  WR  convergence  for  nonlinear 
diagonally  dominant  systems,  and  demonstrated  the  result’s  practical  implications  on  one  dimensional  MOS 
device  simulation.  We  also  reformulated  the  capacitance  extraction  problem  into  an  iterative  algorithm  whose 
steps  involve  a  potential  field  from  point  charge  calculation,  for  which  order  N  log  N  approximate 
algorithms  exist.  This  implies  that  it  is  possible  to  reduce  the  complexity  of  capacitance  extraction  problem 
from  the  commonly  used  N3  approach  to  N  log  N.  Finally,  we  found  several  new  approaches  for  the 
detailed  simulation  of  switching  filter  circuits,  and  in  particular  implemented  a  new  method  for  distortion 
calculation  of  switched  capacitor  filters. 

In  the  immediate  future,  we  will  continue  the  investigation  of  the  capacitance  calculation  problem,  and 
will  also  try  to  apply  the  N  log  N  approach  to  the  calculation  of  inductances.  In  the  area  of  device 
simulation,  we  will  be  working  on  numerical  techniques  for  solving  the  hydrodynamics-based  MOS  device 
equations,  parallel  iterative  techniques  for  two  and  three  dimensional  device  simulation,  and  parallel  WR  for 
mixed  device-circuit  simulation.  In  the  area  of  circuit  simulation,  we  are  investigating  parallel  nonlinear  multi¬ 
grid  like  techniques  for  the  simulation  of  analog  arrays,  investigating  parallel  exponential  fitting  discretization 
schemes,  and  trying  to  extend  the  approach  to  simulation  of  clocked  analog  circuits  to  phase-locked  loops. 
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A  MEMORY  DESIGN  FOR  THE  MESSAGE-DRIVEN  PROCESSOR 


Soha  M.  N.  Hassoun 


Abstract 

The  Message-Driven  Processor  (MDP)  is  a  low-latency  processing  node  for  a  scalable 
fine-grain  MIMD  concurrent  computer,  the  Jellybean  Machine.  Programs  are  executed 
by  passing  messages  through  a  low-latency  network.  Each  MDP  integrates  a 
processor,  a  memory,  and  a  communication  network.  On  top  of  this  message-passing 
model,  the  MDP  supports  a  global  virtual  address  space. 

This  thesis  involves  the  design  and  implementation  of  a  memory  for  the  Message-Driven 
Processor.  The  memory  array  can  be  accessed  by  index,  by  row,  or  as  a  set-associative 
cache.  Index  operations  are  used  to  read  and  write  memory.  Row  operations  reduce 
the  latency  in  message-handling  by  providing  special  purpose  buffers.  Row  Buffers  that 
access  four  words  (a  row)  of  memory  simultaneously.  Two  Queue  Row  Buffers  enable 
buffering  messages  at  two  different  priority  levels  as  soon  as  they  arrive  from  the 
network.  An  Instruction  Row  Buffer  acts  as  a  small  instruction  cache.  Set-associative 
operations  provide  a  translation  mechanism  to  enable  translating  any  object  to  its 
associated  item.  MDP  operating  system  routines  use  this  cache  to  translate  virtual 
identifiers  into  global  addresses. 

The  microarchitecture  and  the  circuit  design  of  the  memory  is  developed.  A  test  chip  is 
fabricated  to  verify  the  design.  Evaluation  of  the  row  operations  is  presented. 
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DESIGN  OF  A  NETWORK  FOR  CONCURRENT  MESSAGE  PASSING  SYSTEMS 


Paul  Y.  Song 


Abstract  • 

We  describe  the  design  of  the  network  design  frame  (NDF),  a  self-timed  routing  chip  for  a  message¬ 
passing  concurrent  computer.  The  NDF  uses  a  partitioned  data  path,  low-voitage  output  drivers,  and  a 
distributed  token-passing  arbiter  to  provide  a  bandwidth  of  450  Mbits/sec  into  the  network.  Wormhole 
routing  and  bidirectional  virtual  channels  are  used  to  provide  low  latency  communications,  less  than  2us 
latency  to  deliver  a  216  bit  message  across  the  diameter  of  a  1 K  node  mess-connected  machine.  To 
support  concurrent  software  systems,  the  NDF  provides  two  logical  networks,  one  for  user  messages  and 
one  for  system  messages.  The  two  networks  share  the  same  set  of  physical  wires.  To  facilitate  the 
development  of  network  nodes,  the  NDF  is  a  design  frame.  The  NDF  circuitry  is  integrated  into  the  pad 
frame  of  a  chip  leaving  the  center  of  the  chip  uncommitted. 

We  define  an  analytic  framework  in  which  to  study  the  effects  of  network  size,  network  buffering  capacity, 
bidirectional  channels,  and  traffic  on  this  class  of  networks.  The  response  of  the  network  to  various 
combinations  of  these  parameters  are  obtained  through  extensive  simulation  of  the  network  model. 
Through  simulation,  we  are  able  to  observe  the  macro  behavior  of  the  network  as  opposed  to  the  micro 
behavior  of  the  NDF  routing  controller. 

We  subsequently  define  the  limitations  of  the  network  and  propose  recommendations  for  enhancing  the 
network  performance.  The  limitation  of  the  network  arises  from  contention  for  the  switching  elements  of 
the  NDF.  The  use  of  virtual  channels  allows  better  utilization  of  network  bandwidth  by  doubling  the  number 
of  switches  at  each  node.  A  three  dimensional  version  of  the  NDF  will  be  needed  to  support  large 
machines  that  exceed  16  nodes  in  a  dimension.  Adding  a  third  dimension  increases  the  bisection  width  of 
the  network  and  gives  us  more  throughput 
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A  MIXED  FREQUENCY-TIME  APPROACH  FOR  FINDING  THE  STEADY-STATE 
SOLUTION  OF  CLOCKED  ANALOG  CIRCUITS 


K.  Kundert,  J.  White,  and  A.  Sangiovanni-Vincentelli 


Abstract 

Performing  detailed  simulation  of  docked  analog  drcuits  (e.g.  switched-capacitor  filters 
and  switching  power  supplies)  with  circuit  simulation  programs  like  SPICE  is 
computationally  very  expensive.  In  this  paper  we  present  a  new,  more  efficient,  method 
for  computing  the  detailed  steady-state  solution  of  clocked  analog  circuits.  The  method 
exploits  the  property  of  such  circuits  that  the  waveforms  in  each  clock  cycle  are  similar 
but  not  exact  duplicates  of  the  proceeding  or  following  cycles.  Therefore,  by  computing 
accurately  a  few  selected  cycles,  the  entire  steady-state  solution  can  be  constructed 
efficiently. 


Microsystems 

Research  Center 
Room  39-321 


Massachusetts 
Institute 
of  Technology 


Cambridge 

Massachusetts 

02139 


Teleohone 
(617)  253-3138 


Massachusetts 
Institute 
of  Technology 


Microsystems 

Research 

Center 


Cambridge 

Massachusetts 

02139 


Room  39-321 
Telephone 
(617)  253-8138 


VLSI  Memo  No.  88^49 
June  1988 


THE  RECONF1GURABLE  ARITHMETIC  PROCESSOR 


Stuart  Fiske  and  William  J.  Dally 


Abstract 

The  Reconfigurable  Arithmetic  Processor  (RAP)  is  an  arithmetic  processing  node  for 
message-passing,  MIMD  concurrent  computer.  It  incorporates  on  one  chip  several 
serial,  64  bit  floating  point  arithmetic  units  connected  by  a  switching  network.  By 
sequencing  the  switch  through  different  patterns,  the  RAP  chip  calculates  complete 
arithmetic  formulas.  By  chaining  together  its  arithmetic  units  the  RAP  reduces  the 
amount  of  off  chip  date  transfer;  In  the  examples  we  have  simulated  off  chip  I/O  can 
often  be  reduced  to  30%  or  40%  of  that  required  by  a  conventional  arithmetic  chip. 
Simulations  predict  a  peak  performance  of  20M  Flops  with  80GM  bit/sec  off  chip 
bandwidth  in  a  2  jmCMOS  process. 
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REDUCING  THE  PARALLEL  SOLUTION  TIME  OF  SPARSE  CIRCUIT  MATRICES 
USING  REORDERED  GAUSSIAN  ELIMINATION  AND  RELAXATION 


David  Smart  and  Jacob  White 


Abstract 

Using  parallel  processors  to  reduce  the  execution  times  of  classical  circuit  simulation 
programs  like  SPICE  and  ASTAP  has  been  the  focus  of  much  current  research.  In  these 
efforts,  good  parallel  speed  increases  have  been  achieved  for  linearized  system 
construction,  but  it  has  been  difficult  to  get  good  parallel  speed  increases  for  sparse 
matrix  solution.  In  this  paper  we  examine  two  approaches  for  reducing  parallel  sparse 
matrix  solution  time;  the  first  based  on  pivot  ordering  algorithms  for  Gaussian 
elimination,  and  the  second  based  on  relaxation  algorithms.  In  the  section  on  Gaussian 
elimination  sparse  matrix  solution,  we  present  a  pivot  ordering  algorithm  which  increases 
the  parallelism  of  Gaussian  elimination  compared  to  the  commonly  used  Markowitz 
method.  The  performance  of  the  new  algorithm  is  compared  to  other  suggested 
ordering  algorithms  for  a  collection  of  circuit  examples.  The  minimum  number  of  parallel 
steps  for  the  solution  of  a  tridiagonal  matrix  is  derived,  and  it  is  shown  that  this  optimum 
is  nearly  achieved  by  the  ordering  heuristics  which  attempt  to  maximize  parallelism.  In 
the  section  on  relaxation,  we  present  an  optimality  result  about  Gauss-Jacobi  over 
Gauss-Seidel  relaxation  on  parallel  processors. 
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WAVEFORM  RELAXATION  APPLIED  TO  TRANSIENT  DEVICE  SIMULATION 


M.  Reichelt,  J.  White,  J.  Allen,  and  F.  Odeh 


Abstract 

In  this  paper  we  investigate  the  possibility  of  accelerating  the  transient  simulation  of  MOS 
devices  by  using  waveform  relaxation.  Standard  spatial  discretization  techniques  are 
used  to  generate  a  large,  sparsely-connected  system  of  algebraic  and  ordinary 
differential  equations  in  time.  The  waveform  relaxation  (WR)  algorithm  for  solving  such  a 
system  is  described,  and  several  theoretical  results  that  characterize  the  convergence  of 
WR  for  device  simulation  are  given.  In  addition,  one-dimensional  experimental  results 
are  presented. 
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Abstract 

In  this  paper  we  investigate  the  possibility  of  accelerating  the  tran¬ 
sient  simulation  of  MOS  devices  by  nsing  waveform  relaxation.  Standard 
spatial  discretization  techniques  are  used  to  generate  a  large,  sparsely- 
connected  system  of  algebraic  and  ordinary  differential  equations  in  time. 
The  waveform  relaxation  (WR)  algorithm  for  solving  such  a  system  is  de¬ 
scribed,  and  several  theoretical  results  that  characterize  the  convergence 
of  WR  for  device  simulation  are  given.  In  addition,  one-dimensional  ex¬ 
perimental  results  are  presented. 


1  Introduction 

Both  digital  and  analog  MOS  circuit  designers  rely  heavily  on  circuit  simulation 
programs  like  SPICE  [3]  to  insure  the  correctness  and  to  test  the  performance  of 
their  designs.  For  most  applications,  the  lumped  MOS  models  used  in  these  pro¬ 
grams  [9]  accurately  reflect  the  behavior  of  terminal  currents  and  charges,  but  in 
some  cases,  these  models  are  not  adequate.  In  particular,  charge  redistribution 
between  source  and  drain  during  device  switching  cannot  easily  be  modeled  by 
a  lumped  device,  but  the  details  of  this  charge  redistribution  can  have  an  im¬ 
portant  effect  on  circuit  behavior.  In  circuits  like  dynamic  memory  cells,  sense 
amplifiers,  analog-to-digital  converters,  and  high  frequency  operational  ampli¬ 
fiers,  charge  redistribution  effects  may  not  only  degrade  performance,  but  can 
inhibit  proper  function. 

For  these  critical  applications,  sufficiently  accurate  transient  simulations  can 
be  performed  if,  instead  of  using  a  lumped  model  for  each  transistor,  the  transis- 
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tor  terminal  currents  and  charges  are  computed  by  numerically  solving  the  drift- 
diffusion  based  partial  differential  equation  approximation  for  electron  transport 
in  the  device.  However,  simulating  even  a  few  transistor  circuit  in  this  way  is 
very  computationally  expensive,  because  the  accurate  solution  of  the  transport 
transport  equations  an  MOS  device  requires  a  two  dimensional  mesh  with  more 
than  a  thousand  points. 

In  this  paper  we  investigate  the  possibility  of  accelerating  the  transient  sim¬ 
ulation  of  MOS  devices  by  using  waveform  relaxation.  In  the  next  section  we 
start  by  introducing  the  equations  for  transient  device  simulation.  Then  we  view 
the  result  of  applying  commonly  used  spatial  discretization  techniques  to  these 
equations,  generating  a  large,  sparsely-connected  system  consisting  of  algebraic 
and  ordinary  differential  equations  in  time.  In  Section  3  we  present  the  waveform 
relaxation  algorithm  for  solving  such  a  system,  and  suggest  why  it  may  be  par¬ 
ticularly  efficient.  Several  theoretical  results  that  characterize  the  convergence 
of  the  method  are  presented  in  Section  4,  and  one-dimensional  experimental 
results  are  described  in  section  5.  Finally,  conclusions  and  acknowledgements 
are  given  in  section  6. 

2  Classical  Simulation  Equations 

The  terminal  behavior  of  an  MOS  device  is  well  described  by  the  Poisson  equa¬ 
tion  and  the  electron  current-continuity  equation  [5] 

eV*r/>  +  q(N -n)  =  0  (1) 

_  f  dn  ... 

V'Jn_?at'  =  0  (2) 

In  these  equations  0  is  the  electrostatic  potential,  q  is  the  magnitude  of  elec¬ 
tronic  charge,  n  is  the  electron  concentration,  and  Jn  is  the  electron  current 
density.  N  is  the  net  doping  concentration  given  by  N  =  Np  —  Na  where  ND 
and  Na  are  the  donor  and  acceptor  concentrations. 

The  electron  current  density  is  commonly  approximated  by  the  drift-diffusion 
equation: 

Jn  =  -q  (fin  n  V0  -  D„ Vn)  (3) 

where  fjn  is  the  electron  mobility,  and  D„  is  the  diffusion  coefficient.  An  equa¬ 
tion  system  with  only  n  and  0  as  unknowns  is  derived  by  using  (3)  to  eliminate 
Jn  from  (2). 

There  are  a  variety  of  ways  to  spatially  discretize  the  system  of  two  equa¬ 
tions  in  the  two  unknowns  n  and  0.  Given  a  rectangular  two  dimensional  mesh, 
a  common  approach  is  to  use  a  finite-difference  formula  for  the  Poisson  equa¬ 
tion,  and  an  exponentially-fit  finite-difference  formula  for  the  current-continuity 
equation.  For  notational  simplicity,  we  will  assume  that  the  mesh  points  are 
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evenly  spaced  a  distance  /  apart,  so  that  the  discretized  Poisson  equation  at 
each  mesh  point  t  is: 


-  ^i)  +  ql2(Ni  -  rti)  =  0  (4) 
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where  ni,ipi,  and  Ni  are  the  electron  concentration,  the  potential,  and  the  net 
doping  concentration  at  mesh  point  i.  The  summation  is  taken  over  the  nodes 
j  surrounding  »  (four  nodes  for  a  mesh  node  t  not  on  the  boundary,  i.e.  north, 
south,  east,  and  west). 

Under  the  same  assumptions,  and  assuming  constant  mobility,  the  discretized 
current-continuity  equation  with  the  drift-diffusion  approximation  becomes: 

qDn  ^2  ~  «i)n;  “  s(u»  "  “  9*2  (^n<)  =  0  (5) 

where  u,  =  q^/KT  and  B(x)  =  x/(expx  —  1)  is  the  Bernoulli  function  used 
to  exponentially  fit  the  potential  variation  to  the  electron  concentration  varia¬ 
tion.  In  this  equation,  the  Einstein  relation  D„  =  ( KT/q)fxn  has  been  used  to 
eliminate  n„. 

If  there  are  m  mesh  points,  then  the  result  of  applying  the  spatial  discretiza¬ 
tion  to  ( 1),(2),  and  (3)  is  a  sparse  system  of  m  algebraic  constraints,  represented 
by  (4),  and  a  sparsely  connected  system  of  m  ordinary  differential  equations, 
represented  by  (5). 

3  The  Waveform  Relaxation  Process 

The  standard  approach  used  to  solve  these  two  systems  is  to  discretize  the 
3 7«.(f)  term  in  (5)  with  a  low  order  integration  method  such  as  backward-Euler 
[1].  The  result  is  a  sequence  of  algebraic  systems  in  2m  unknowns,  each  of  which 
can  be  solved  with  some  variant  of  Newton’s  method  and/or  relaxation.  Another 
approach  is  to  apply  relaxation  directly  to  the  differential  equation  system.  This 
leads  to  a  time  waveform  relaxation  process,  as  given  by  the  following  algorithm. 

Although  only  the  Gauss-Jacobi  algorithm  is  presented  for  the  sake  of  no- 
tational  simplicity,  a  Gauss-Seidel  version  could  be  created  by  adjusting  the 
iteration  indexes. 

The  WR.  algorithm  reduces  the  problem  of  simultaneously  solving  m  differ¬ 
ential  equations  and  m  algebraic  equations  to  one  of  iteratively  solving  2m  inde¬ 
pendent  equations.  Each  of  the  m  differential  equations  for  the  n<(<)  waveforms 
can  be  solved  with  a  numerical  integration  method  such  as  backward-Euler. 
Since  they  only  contribute  algebraic  constraints,  the  equations  for  calculating 
the  waveforms  need  to  be  solved  only  at  the  discrete  points  in  time  used 
to  calculate  the  nj(t)  waveforms. 
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Algorithm  1  WR  Gauss- Jacobi  Algorithm  for  solving  the  system 
produced  by  equations  (4)  and  (5). 

The  superscript  k  denotes  the  iteration  count,  the  subscript 
i  denotes  the  component  index  of  a  vector,  and  e ^  and  e„ 
are  small  positive  numbers. 

k^  0 
repeat  { 

k  « —  4  +  1 

foreach(i  €  {l,...,n})  { 
solve 

e  £  (V£“‘  -  +  ql 2  (Ni  -  n?-1)  =  0 

[fl( uj~1  -  B(uJ_1  -u)-1)^] 

”«**  (lini)  =  0 

for(V>?(t),  nf  (t);  t  €  [0,  T],  nf  (0)  =  nj0) 

} 

}  until(||^*  -  V’*_1|l  <  and  ||r»*  -  n*"1!!  <  c„) 
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The  inherent  advantage  of  the  WR  approach  is  that  the  differential  equations 
are  solved  in  a  decomposed  fashion,  and  therefore  different  sets  of  timesteps  can 
be  used  at  different  mesh  points  to  calculate  the  time  evolution  of  the  electron 
concentration.  The  method  exploits  multi-rate  behavior.  In  MOS  devices,  the 
rate  at  which  electron  concentrations  evolve  may  be  very  different  in  the  channel 
compared  to  the  source  or  the  drain.  Therefore,  WR  may  prove  to  be  efficient 
for  the  device  simulation  problem,  provided  it  converges,  and  doesn’t  take  too 
many  iterations.  This  is  the  subject  of  the  next  section. 


4  Theoretical  Results 

As  is  usually  the  case  for  waveform  relaxation  algorithms  applied  to  systems  of 
differential  equations,  Algorithm  1  converges  to  the  solution  of  the  differential- 
algebraic  system  for  any  initial  guess  that  matches  the  initial  conditions.  The 
precise  statement  is  given  in  the  following  theorem. 

Theorem  1  Given  a  finite  interval  [0,  T],  and  any  initial  guess  n°(t)  and  V’°(f), 
t  €  [0,  T],  such  that  n°(0)  =  n0,  the  sequence  of  waveforms  produced  by  Alg.  1 
converges  to  the  exact  solution  of  the  system  given  by  equations  (4)  and  (5). 

The  proof  of  the  above  theorem  follows  the  same  steps  as  the  Picard-like 
proofs  of  waveform  relaxation  for  ordinary  differential  equations  [10].  First  the 
equations  that  describe  the  difference  between  one  iteiaiion  and  the  next  are 
organized  into  the  form 

S^k+l  =  A6tl>k  +  B6nk{t)  (6) 


and 

Snk+1(t)  =  f'  [f{nk+\t),nk{t),*k(t))  -  f{nk{t ),«*-»(*),  **-*(<))]  (7) 

Jo 

where  =  V?  —  Snk  =  n*  —  nk~l.  The  matrices  A,  B  €  5imxm  and 

the  function  /  :  — ►  3?”  are  constructed  from  the  iteration  equations 

in  Alg.  1.  The  next  step  is  to  show  that  (6)  and  (7)  represent  a  contraction. 
To  this  end,  consider  an  interval  of  time  short  enough  to  insure  equation  (7) 
represents  a  contraction  with  respect  to  n  for  a  fixed  rl>.  That  (6)  is  a  contraction 
with  respect  to  t/>  for  a  fixed  n  is  well-known  [8],  as  (6)  represents  relaxation 
applied  to  the  Poisson  equation.  One  can  fit  the  two  contractions  together  to 
show  that  relaxation  applied  to  the  coupled  system  converges. 

The  above  proof  outline  suggests  that  the  WR  algorithm  converges  in  a 
nonuniform  manner.  That  is,  first  convergence  is  achieved  over  a  short  time 
interval,  set  by  what  is  needed  to  make  (7)  a  contraction,  then  over  the  next 
short  time  interval,  and  then  the  next,  continuing  slowly,  until  the  convergence 
is  achieved  throughout  an  entire  interval  of  interest.  When  applied  to  general 
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differential  equation  systems,  like  circuits,  WR  does  demonstrate  this  nonuni¬ 
formity  in  the  convergence  [7],  but  WR  does  not  usually  show  nonuniformity 
when  applied  to  the  transient  device  simulation  problem. 

In  order  to  analyze  why  this  is  the  case,  we  will  consider  a  model  problem  of 
just  the  differential  equation  associated  with  the  electron  concentration,  n  and 
assume  that  the  potential  ip  is  known.  The  WR  iteration  update  equation  for 
this  case  is  then 

Dn  Y  “  «.>*  “  B(ui  -  u,)nf +1]  -  l2  =  0  (8) 

for  each  i  €  {1, . . .  m}.  Note  that  given  ip,  (8)  is  a  linear  time- varying  differential 
equation  in  n.  For  this  problem  we  have  the  following  theorem: 

Theorem  2  If  at  each  time  t,  ip(t)  is  such  that  the  electric  field  along  any 
vertical  or  horizontal  line  is  either  constant,  or  monotonically  increasing,  then 
(8)  is  a  contraction  in  a  uniform  norm  on  any  finite  interval  [0,  T\.  That  is, 

maz[0tTj||Sn‘+1(*)||  <  7max[0ir]||5nt(t)||  (9) 

where  7  <  1. 

The  proof  of  Theorem  2  is  given  in  the  appendix. 

Since  allowing  the  different  differential  equations  to  take  very  different  timesteps 
is  WR’s  main  advantage,  if  this  property  were  limited  to  insure  convergence,  the 
WR  algorithm  would  not  be  effective.  Fortunately,  that  the  WR  algorithm  is  a 
I  contraction  in  a  unform  norm  on  any  interval  implies  that  the  timesteps  used 

to  numerically  integrate  the  differential  equations  are  almost  unconstrained. 
Given  that  the  different  differential  equations  use  different  timesteps,  interpo¬ 
lation  must  be  used  to  communicate  results  between  equations,  and  if  not  done 
carefully  this  can  cause  nonconvergence.  Linear  interpolation  is  certain  not 
cause  problems,  and  therefore  we  have  the  following  theorem  [7]: 

Theorem  3  Let  each  of  the  m  independent  WR  iteration  update  equations 
given  in  (8)  be  solved  numerically  with  backward- Euler,  with  m  different  sets  of 
timesteps.  In  addition,  assume  that  linear  interpolation  is  used  to  derive  values 
for  the  njs  between  time  discretization  points.  Then  this  multirate  discretized 
WR  algorithm  for  (8)  converges,  regardless  of  the  timestep  selections. 


5  One  Dimensional  Experiments 

Except  for  Theorem  1,  the  above  theoretical  results  only  apply  under  certain 
conditions,  and  are  only  an  indication  that  the  WR  algorithm  may  be  effective. 
In  order  to  verify  that  the  theoretical  results  apply  in  actual  simulation,  a  one¬ 
dimensional  transient  device  simulation  program  was  written  and  applied  to  a 
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one-dimensional  approximation  of  an  MOS  device  with  a  conducting  channel. 
The  doping  distribution  for  the  one-dimensional  device  is  given  in  Fig.  1,  where 
the  tick  marks  denote  the  mesh  points.  Potential  and  electron  concentration 
boundary  conditions  were  given  a-  x  =  0.0  and  *  =  3. Op.  The  boundary  values 
for  the  electron  concentration  were  computed  assuming  charge  neutrality  at  the 
“contacts” . 

The  relaxation  process  was  tested  by  first  solving  the  static  problem  with 
zero  volts  across  the  “device”,  and  then  making  a  step  change  of  five  volts. 
Even  with  this  simple  example,  the  variable-by- variable  WR  algorithm  as  given 
in  Alg.  1  was  ineffective.  The  iterates  did  not  converge  in  a  uniform  manner, 
and  they  converged  very  slowly. 

In  order  to  improve  convergence,  rather  than  using  variable-by-variable  de¬ 
composition,  we  partitioned  the  problem  into  blocks  based  on  two  techniques. 
First,  we  associated  the  electron  concentration  at  node  i,  n,(f)  with  the  potential 
rpi(t)  at  that  node.  Then,  in  order  to  try  to  satisfy  the  assumptions  of  Theorem 
2,  we  placed  together  neighboring  nodes  where  we  expected  rapid  changes  in 
the  electric  field.  The  resulting  partitioning  of  the  nodes  are  boxed  in  Fig.  1. 

The  resulting  waveform  iterations  for  the  slowest  converging  variable,  the 
electron  concentration  for  the  mesh  point  where  the  doping  changes  abruptly,  is 
plotted  in  Fig.  2.  As  the  figure  indicates,  with  the  partitioning  just  described, 
the  WR  process  converges  in  just  a  few  iterations  and  the  contraction  is  uniform 
through  time  as  predicted  by  Theorem  2.  The  simulation  was  rerun  with  very 
coarse  timesteps  to  see  the  effects  on  convergence,  and  the  WR  iterations  for  the 
same  node  is  plotted  in  Fig.  3.  As  the  figure  indicates,  using  ccarse  timesteps 
does  not  effect  the  overall  convergence,  although  the  convergence  for  small  t  is 
slowed. 


6  Conclusions  and  Acknowledgements 

In  this  paper  we  presented  some  preliminary  results  that  indicate  the  WR  al¬ 
gorithm  may  indeed  be  efficient  for  device  transient  simulation.  In  particular, 
it  was  shown  that  under  conditions  that  can  be  arranged  for  in  practice,  the 
WR  algorithm  is  a  contraction  in  a  uniform  norm  on  any  interval  [0,  T].  Also, 
given  these  same  conditions,  the  relaxation  process  will  still  converge  even  if 
very  different  sets  of  timesteps  are  used  for  the  individual  iteration  equations. 
Finally,  we  verified  the  theoretical  results  on  a  one  dimensional  example. 

There  are  several  aspects  of  WR  that  need  to  be  addressed  if  this  method 
it  to  be  efficent  for  two-dimensional  MOS  transient  device  simulation.  Most 
important,  a  general  algorithm  for  blocking  the  device  must  be  developed.  An 
efficent  approach  for  determining  what  discretization  points  to  use  for  the  alge¬ 
braic  constraints  must  be  considered.  In  addition,  the  efficiency  of  WR  methods 
can  also  be  improved  by  refining  the  timesteps  with  iterations,  or  using  a  single 
waveform- Newton  iteration  to  solve  the  nonlinear  WR  iteration  equations. 
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A  Proof  of  Theorem  2 

The  WR  iteration  equations  applied  to  the  model  problem  (8)  can  be  described 
as 

ht+l(i)  =  D(t)nk+1(t)  +  M(t)n*(t)  (10) 

where  D(t),M(t)  €  3inxn,  and  D(t)  is  negative  diagonal  matrix.  The  assump¬ 
tions  about  the  electric  field  result  in  values  for  the  Bernoulli  functions  such 
that  D(t)  and  M(t)  will  satisfy  the  relation 

IM«(i)ll>ei  +  EHmoWII-  (11) 

where  f,  >  0  and  is  strictly  greater  than  zero  for  those  i'a  corresponding  to  the 
mesh  points  next  to  the  boundaries.  Note  that  this  implies 

IP(<)-1A/(*)||<7  (12) 

for  7  <  1,  for  some  norm  on  SRnxn  and  for  all  t. 

Given  the  relationship  between  D(t)  and  M(t),  the  WR  algorithm  applied 
to  a  system  of  the  form  of  (13)  will  contract  in  a  uniform  norm.  This  has  been 
shown  for  the  case  when  D(t)  and  M(t)  are  independent  of  t,  using  Laplace 
transforms  [2].  In  the  time  dependent  case,  the  result  can  be  shown  by  examining 
the  difference  between  iteration  k  and  ifc  +  1  of  ( 13)  to  get 

Snf+l(t)  =  dii(t)6nt+1(t)  +  ^Tm0(t)6nf(t)  (13) 

for  each  mesh  point  i,  where  6nk(t)  =  nk(t)  —  nk~l(t).  By  assumption,  du(t)  <  0 
and  <5nf  (0)  =  0.  Therefore, 


max[0iT]|tfnf+1(<)|  <  E  moxlo1T]l^^plma*[o1T]|'5n}(<)|.  (14) 

Equation  (14)  follows  from  the  fact  that  for  all  values  of  6n*+1(t)  on  the  bound¬ 
ary  of  (or  outside)  the  bounded  region  6nk+1(t)  points  back  into  the  bounded 
region  [6]. 

Assembling  the  equation  system  from  (14)  results  in 

maX[0,T]j5n*+1(<)|  <  max^  ^|D(t)-1M(<)|mai[0  ^|<5nfc(<)|.  (15) 

Then  in  the  norm  for  which  ||£>(t)-1M(t)j|  <  7  <  1.0, 

max(0iT]||(5nt+1(t)|j  <  7max[0  T]||(5nk(t)||.  (16) 

which  proves  the  theorem. 
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(picoseconds) 

Figure  2:  The  uniform  WR  convergence  of  the  electron  concentration  at  a  node. 


(picoseconds) 

Figure  3:  The  waveforms  converge  uniformly,  even  when  the  timesteps  are  coarse. 
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Abstract 

The  power  of  Butterfly-type  networks  relative  to  other  proposed  multicomputer 
interconnection  networks  is  studied,  by  considering  how  efficiently  the  Butterfly 
can  simulate  the  other  networks.  Simulation  is  represented  formally  via  graph 
embeddings,  so  the  topic  here  becomes:  How  efficiently  can  one  embed  the  graph 
underlying  a  given  network  in  the  graph  underlying  the  Butterfly  network?  The 
efficiency  of  an  embedding  of  a  graph  G  in  a  graph  H  is  measured  in  terms  of: 
the  dilation ,  or,  the  maximum  amount  that  any  edge  of  G  is  “stretched”  by  the 
embedding;  the  expansion ,  or,  the  ratio  of  the  number  of  vertices  of  H  to  the 
number  of  vertices  of  G.  Three  general  results  about  embeddings  in  Butterfly-type 
graphs  are  established  here,  that  expose  a  number  of  simulations  by  Butterfly-type 
networks,  which  are  optimal  (to  within  constant  factors):  (1)  Any  complete  binary 
tree  can  be  embedded  in  a  Butterfly  graph,  with  simultaneous  dilation  0(1)  and 
expansion  0(1).  (2)  Any  n-vertex  graph  having  a  \/2-bifurcator  of  size  5  =  fl(log  n) 
can  be  embedded  in  a  Butterfly  graph  with  simultaneous  dilation  O(logS)  and 
expansion  0(1).  (3)  Any  embedding  of  a  planar  graph  G  in  a  Butterfly  graph 
must  have  dilation  fl  2(G)  is  the  size  of  the  smallest  1/3-2/3  vertex- 

separator  of  G\  $(G)  is  the  size  of  G’s  largest  interior  face.  Corollaries  include:  (a) 
The  n-vertex  X-tree  can  be  embedded  in  the  Butterfly  with  simultaneous  dilation 
0(log  log  n)  and  expansion  0(1);  no  embedding  yields  smaller  dilation,  independent 
of  expansion,  (b)  Every  embedding  of  the  n  x  n  mesh  in  the  Butterfly  has  dilation 
fl(logn);  any  expansion-O(l)  embedding  of  the  mesh  in  the  Butterfly  achieves  this 
dilation.  These  results,  which  extend  to  Butterfly-like  graphs  such  as  the  Cube- 
Connected  Cycles  and  Benes  networks,  supply  the  first  examples  of  graphs  that  can 
be  embedded  more  efficiently  in  the  Hypercube  than  in  the  Butterfly. 


1.  INTRODUCTION 


This  paper  reports  on  a  continuing  program  of  the  authors,  dedicated  to  determining 
the  relative  computational  capabilities  of  the  various  interconnection  networks  that 
have  been  proposed  for  use  as  multicomputer  interconnection  networks  [BCLR,  BI, 
GHR,  Lej.  We  focus  here  on  one  member  of  the  family  of  butterfly- like  machines, 
that  have  become  one  of  the  benchmark  architectures  for  multicomputers.  The 
major  contributions  of  this  paper  are  the  following  general  results  about  embeddings 
of  graphs  in  Butterfly  networks1: 

1.  We  embed  the  complete  binary  tree  in  the  Butterfly  network,  with  simulta¬ 
neous  dilation  0(1)  and  expansion  0(1). 

2.  We  embed  any  n-vertex  graph  having  a  \/2-bifurcator  of  size  5  =  fl(Iogn) 
in  the  Butterfly  network,  with  simultaneous  dilation  O(IogS)  and  expansion 
0(1). 

3.  We  prove  that  any  embedding  of  any  planar  graph  G  in  a  Butterfly  network 
must  have  dilation 


where:  £(G)  is  the  size  of  the  smallest  1/3-2/3  vertex-separator  of  G;  $(G) 
is  the  size  of  G’s  largest  interior  face. 

The  latter  two  results  lead  to  embeddings  of  graphs  such  as  X-trees  and  meshes  in 
the  Butterfly,  that  are  optimal,  to  within  constant  factors.  By  Result  2,  such  embed¬ 
dings  can  be  found  with  expansion  0(1)  and  with,  respectively,  dilation  0(log  Iogn) 
and  O(logn);  by  Result  3,  no  embeddings  can  improve  on  these  dilations,  indepen¬ 
dent  of  expansion.  These  embeddings  expose  X-trees  and  meshes  as  the  first  known 
graphs  that  can  be  embedded  very  efficiently  in  the  Hypercube  (simultaneous  dilation 
0(1)  and  expansion  0(1))  but  have  no  efficient  embedding  in  butterfly-like  graphs. 
Note  that,  if  we  restrict  attention  only  to  the  issue  of  dilation,  then  -  to  within  con¬ 
stant  factors  -  these  graphs  cannot  be  embedded  any  more  efficiently  in  Butterfly 
graphs  than  they  can  in  complete  binary  trees! 


1.1.  The  Formal  Setting 

The  technical  vehicle  for  our  investigations  is  the  following  notion  of  graph  embed¬ 
ding  [Roj.  Let  G  and  H  be  simple  undirected  graphs.  An  embedding  of  G  in  H  is  a 

'All  technical  terms  are  defined  in  Section  1.1. 
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one-to-one  association  of  the  vertices  of  G  with  vertices  of  H ,  plus  a  routing  of  each 
edge  of  G  within  H,  i.e.,  an  assignment  of  a  path  in  H  connecting  the  images  of 
the  endpoints  of  each  edge  of  G.  The  dilation  of  the  embedding  is  the  length  of  the 
longest  path  in  H  that  routes  an  edge  of  G\  it  thus  measures  how  much  the  edges  of 
G  are  “stretched”  by  the  embedding.  The  expansion  of  the  embedding  is  the  ratio 
\H\/\G\  of  the  number  of  vertices  in  H to  the  number  of  vertices  in  G.  We  use  the 
dilation-  and  expansion-costs  of  the  best  embedding  of  G  in  H  as  our  measures  of 
how  well  H  can  simulate  G  as  an  interconnection  network:  One  views  the  graph  Ii  as 
abstracting  the  processor-intercommunication  structure  of  a  physical  architecture; 
one  views  the  graph  G  as  abstracting  either  the  task-interdependency  structure  of 
an  algorithm  one  wants  to  implement  on  H  or  the  processor-intercommunication 
structure  of  an  architecture  one  wants  to  simulate  on  H . 

Remark.  A  third  important  measure  of  how  well  H  can  simulate  G  is  congestion , 
the  maximum  number  of  edges  that  are  routed  through  a  single  edge  (or  vertex)  of 

H.  Congestion  does  not  play  a  major  role  in  this  paper,  however,  since 

I.  our  embedding  of  a  complete  binary  tree  in  a  Butterfly  trivially  has  unit  conges¬ 
tion; 

2.  the  n-vertex  Butterfly  is  known  to  be  able  to  simulate  any  n-vertex  bounded- 
degree  graph  with  O(logn)  delay,  irrespective  of  the  fact  that  the  dilation  and 
congestion  of  the  corresponding  embedding  may  both  be  H(logn);  3.  our  major 
focus  is  on  developing  broadly  applicable  techniques  for  bounding  the  dilation  of 
embeddings. 

Hence,  for  our  purposes,  dilation  is  the  central  measure  of  concern. 

Our  results  hold  for  a  large  variety  of  “levelled”  Hypercube-derivative  host 
graphs  (which  play  the  role  of  our  H' s),  that  we  collectively  term  butterfly  net¬ 
works.  For  the  sake  of  rigor,  we  focus  on  one  particular  such  network  (which  can 
be  viewed  as  the  FFT  network,  with  input  and  output  vertices  identified),  although 
we  could  just  as  easily  substitute  other  such  graphs  -  the  Cube-Connected  Cycles 
(PV)  or  Benes  network  [Be],  for  example.  Formally, 

•  Let  m  be  a  positive  integer.  The  m-level  Butterfly  graph  B(m)  has  vertex-set2 


Vm  =  {0, 1,  •  •  • ,  m  —  1}  x  {0,1}"\ 


The  subset  Vm,<  =  {£}  x  {0,l}m  of  Vm  (0  <  l  <  m)  is  the  £th  level  of  B{m). 
The  string  x  €  {0,  l}m  of  vertex  (£,  i)  is  the  position-within-level  string  ( PWL 
string ,  for  short)  of  the  vertex.  The  edges  of  B(m)  form  butterflies  (or,  copies 

2{0, 1}"‘  denotes  the  set  of  length-m  binary  strings. 
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Figure  1:  The  3-level  Butterfly  graph  £(3) 

of  A'2,2)  between  consecutive  levels  of  vertices,  with  wraparound  in  the  sense 
that  level  0  is  identified  with  level  m.  Each  butterfly  connects  vertices 

(£,  0o0i  ‘ '  0t-iO0t+i  ’ '  •  0m-i) 

and 

(l,  0001  *  *  •  0t-ll0t+l  •  *  ’  0m-l) 

on  level  £  of  B[m)  (0  <  t  <  m;  each  0i  €  {0, 1})  with  vertices 
(l  +  l(mod  m),  0o0i---0t-iO0t+i---0m-i) 

and 

{£  +  l(mod  m),  0Q0X  •  •  •  0t_x\0l+l  ■  •  •  0m.x) 

on  level  l  -I-  l(mod  m)  of  B{m).  One  can  represent  B[m)  level  by  level,  in 
such  a  way  that  at  each  level  the  PWL  strings  are  the  reversals  of  the  binary 
representations  of  the  integers  0, 1, . . . ,  2m  —  1,  in  that  order.  See  Fig.  1. 

The  guest  graphs  in  our  study,  which  play  the  role  of  our  G’s,  are  complete 
binary  trees,  X-trees,  and  meshes;  see  Fig.  2.  Formally, 


•  The  height-h  complete  binary  tree  T[h)  is  the  graph  whose  (2/*+l  -  l)-e!ement 
vertex-set  comprises  all  binary  strings  of  length  at  most  h ,  and  whose  edges 
connect  each  vertex  z  of  length  less  than  h  with  vertices  zO  and  zl.  The 
(unique)  string  of  length  0  is  the  root  of  the  tree,  which  is  the  sole  occupant  of 
level  0  of  the  tree;  the  2l  strings  of  length  t  are  the  level-t  vertices  of  the  tree; 
the  strings  of  length  h  (i.eM  the  level-/i  vertices)  are  the  leaves  of  the  tree. 

•  The  height-h  X-tree  X(h)  is  the  graph  that  is  obtained  from  the  height~/i 
complete  binary  tree  T(h)  by  adding  cross  edges  connecting  the  vertices  at 
each  level  of  T(h)  in  a  path,  with  the  vertices  in  lexicographic  order.  X-trees 
inherit  a  level  structure  from  their  underlying  complete  binary  trees. 

•  The  s  x  s  mesh  M(s)  is  the  graph  whose  s2-element  vertex  set  comprises  the 
ordered  pairs  of  integers 

{1,2, •••,$}  x  {1,2, •••,«}, 

and  whose  edges  connect  vertices  (a,  b)  and  (c,  d)  just  when  (a  —  c|  +  |6-d|  —  1. 

All  of  these  networks  have  been  seriously  proposed  as  interconnection  networks 
for  multicomputers  (DP,  Ga,  HZ],  hence  are  important  candidates  for  our  study. 
Another  approach  to  comparing  these  networks,  via  implementation  and  analysis 
of  specific  algorithms,  appears  in  [Ag], 

Our  lesults  depend  on  three  structural  features  of  a  graph  G: 

1.  Let  S  and  k  be  positive  integers.  The  n-vertex  graph  G  has  a  k-color  \f2- 
bifurcator  of  size  S  if  either  n  <  2  or  the  following  holds  for  every  way  of 
labelling  each  vertex  of  G  with  one  of  k  possible  labels:  By  removing  <  S 
vertices  from  G,  one  can  partition  G  into  subgraphs  Gj  and  G2  such  that3 

(a)  (  [Gi|  -  |G2|  (  <  1. 

( b )  For  each  label  /,  the  number  of  /-labelled  vertices  in  Gt  is  within  1  of  the 
number  of  /-labelled  vertices  in  G2. 

(c)  Each  of  Gi  and  G2  has  a  k-color  \/2-bifurcator  of  size  S/y/ 2. 

2.  A  1/3-2/3  (vertex-) separator  of  G  is  a  set  of  vertices  whose  removal  partitions 
G  into  subgraphs,  each  having  >  |G|/3  vertices;  we  denote  by  £(G)  the  size 
of  the  smallest  1/3-2/3  vertex-separator  of  G. 

3.  When  G  is  planar  and  we  are  given  a  witnessing  planar  embedding  e,  we 
denote  by  $,(G)  the  number  of  vertices  in  G’s  largest  interior  face  in  the 
embedding.  When  e  is  clear  from  context,  we  omit  the  subscript. 

’We  denote  by  |f»’|  the  number  of  vertices  in  the  craph  (.»’. 
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1.2.  The  Main  Results 

We  prove  three  results  about  optimal  embeddings  in  the  Butterfly  that  lead  to  a 
variety  of  nontrivial  optimal  embeddings. 

Theorem  1  The  complete  binary  tree  T(h)  can  be  embedded  in  a  Butterfly  graph, 
with  simultaneous  dilation  0(1)  and  expansion  0(1). 

Obviously,  the  embedding  of  Theorem  1  is  within  a  constant  factor  of  optimal  in 
both  dilation  and  expansion.  Building  on  the  embedding,  we  obtain  the  following 
general  upper  bound  result. 

Theorem  2  Any  n-vertex  graph  G  having  a  y/2-bifurcator  of  size  S  —  fl(logn)  can 
be  embedded  in  a  Butterfly  graph  with  simultaneous  dilation  O(logS)  and  expansion 
0(1). 

We  balance  Theorem  2  with  one  of  the  first  broadly  applicable  results  for  bound¬ 
ing  dilation  from  below. 

Theorem  3  Any  embedding  of  a  nontree  planar  graph  G  in  a  Butterfly  graph  has 
dilation  Cl  (lo|^l^) .  This  bound  cannot  be  improved  in  general. 

Direct  application  of  the  proofs  of  these  results  yields  the  following  optimal 
embeddings. 

Corollary  1  The  height-h  X-tree  X(h)  can  be  embedded  in  a  Butterfly  graph  with 
simultaneous  dilation  0(\ogh)  —  0(log  log  |A'(/t)j)  and  expansion  0(1).  Any  embed¬ 
ding  of  X(h)  in  a  Butterfly  graph  must  have  dilation  Cl(\ogh)  =  fl(loglog  |X(/t)|). 

Corollary  2  Any  embedding  of  the  sxs  mesh  M(s)  in  a  Butterfly  graph  must  have 
dilation  fl(logs)  =  fi(log  (M(s)|). 

Corollary  2  betokens  a  mismatch  in  the  structures  of  meshes  and  Butterfly 
graphs,  since  any  expansion-O(l)  embedding  of  any  graph  G  in  B[m)  has  dilation 
0(log|G|).< 

‘‘This  follows  from  the  facts  that  B{m )  has  m2"*  vertices  and  diameter  0(m). 
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Theorem  1  and  Corollaries  1  and  2  can  be  interpreted  as  yielding  tight  bounds 
on  the  efficiency  with  which  a  Butterfly  machine  can  simulate  a  complete-binary- 
tree  machine,  an  X-tree  machine,  and  a  mesh-structured  machine,  with  regard  to 
both  delay  (dilation)  and  resource  utilization  (expansion).  Equating  dilation  with 
delay  is  most  appropriate  when  the  machines  are  to  be  run  in  SIMD  mode. 

The  next  three  sections  are  devoted  to  proving  our  main  results. 

2.  COMPLETE  BINARY  TREES 

2.1.  Embedding  Many  Small  Trees  in  a  Butterfly 

It  is  obvious  from  inspection  that  one  can  find  an  instance  of  the  height-(m  -  I) 
complete  binary  tree  T(m  —  1)  rooted  at  every  vertex  of  B(m).  Somewhat  less 
obvious  is  the  fact  that  one  can  find  m  mutually  disjoint  instances  of  T{m  -  1)  as 
subgraphs  of  B(m).  We  now  verify  this  fact  via  an  embedding  which  will  prove 
useful  as  we  develop  our  final  embedding. 

Proposition  1  For  every  integer  m,  one  can  find  m  mutually  disjoint  instances  of 
T(m  -  I)  as  subgraphs  of  B(m). 

Proof.  To  simplify  exposition,  we  represent  sets  of  binary  strings  by  strings  over 
the  alphabet  {0, 1,  *},  using  *  as  a  wild-card  character.  The  length-A:  string 

0  =  0o0i  •  •  •  0k-ii 

where  each  /?,  6  {0, 1,  *},  represents  the  set  a(0)  of  ail  length-A:  binary  strings  that 
have  a  0  in  each  position  i  of  0  where  /?,  =  0,  a  1  in  each  position  i  of  0  where 
0i  =  1,  and  either  a  0  or  a  1  in  each  position  i  of  0  where  /?,  =  *.  For  illustration, 
cr(010)  =  {010},  and  <r(0  *  1)  =  {001,011}.  Call  the  string  0  the  code  for  the  set 
a{0). 

On  to  our  embeddings  of  m  instances  of  T(m  -  1)  in  B{m):  For  any  letter  a 
and  nonnegative  integer  k,  we  denote  by  ak  a  string  of  k  a’s. 


For  the  first  instance  of  T(m-  1),  we  have  the  following  correspondence  between 
tree  vertices  and  Butterfly  vertices. 


T{m 

zJl 

B(m) 

level 

0 

(0,  o-) 

level 

1 

(1,  *0m~1) 

level 

2 

{2,  *2Om"2) 

level 

m  —  1 

(m  -  1,  *m-10) 

For  each  subsequent  instance  of  T(m  -  1),  say  the  j'th  where  1  <  j  <  m,  we  have 
the  following  correspondence  between  tree  vertices  and  Butterfly  vertices. 

T(m  -  1)  B{m) 


level  0  O'  -  1, 

level  1  0,  CH_I1  *  0m';“2l) 

.  level  2  0  +  l(mod  m),  O'"1!  *2  0m">-3l) 

level  m-1  (j  -  2, 

The  placement  of  the  l’s  in  the  PWL  strings  ensures  that  the  m  instances  of 
T(m  -  1)  are  mutually  disjoint.  To  verify  this,  via  contradiction,  let  us  look  at  an 
arbitrary  level  l  of  B[m)  and  at  arbitrary  distinct  tree  vertices  i  and  j  that  collide 
at  some  position  within  level  l  of  B{m).  It  is  clear  that  all  Butterfly  vertices  that 
are  images  of  the  same  instance  of  T{m  —  1)  are  distinct,  so  we  may  assume  that, 
vertices  i  and  j  come  from  distinct  instances  of  T{m  -  1),  call  them  i(t)  and  i(j ), 
where  the  t-“name”  of  an  instance  of  T{m  -  1)  is  the  level  of  B{m)  where  its  root 
resides.  We  consider  four  cases  that  exhaust  the  possibilities.  In  each  case,  we 
adduce  a  property  of  the  PWL  strings  that  precludes  any  overlap  in  the  images  of 
the  trees. 

i(»)  =  0: 


If  £.(*')  =  0,  then  the  PWL  string  of  i  ends  with  0m-<,  while  the  PWL 
string  of  j  has  a  1  in  this  range,  specifically,  in  position  m  -  1  if  j  <  l, 
and  in  position  j  if  j  >  l. 

1  <  t(i)  <  t (j)  <  i: 
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Every  PWL  string  of  i(i)  starts  with  O' 1,  while  every  PWL  string  of  t(j) 
starts  with  (PI. 

1  <  i(t)  <  t  <  i(j): 

Every  PWL  string  of  t(t)  has  a  0  in  position  j,  while  every  PWL  string 
of  t(j)  has  a  1  in  that  position. 

t  <  i(t)  <  i(j)  <  m: 

Every  PWL  string  of  *(*)  has  a  1  in  position  i,  while  every  PWL  string 
of  t[j)  has  a  0  in  position  i. 

The  proof  is  complete.  □ 

An  algebraic  proof  of  Proposition  1,  which  is  “cleaner”  than  our  combinatorial 
proof  here,  appears  in  [ABRj;  however,  it  is  the  embedding  rather  than  the  result 
that  will  be  helpful  in  our  proof  of  Theorem  1. 

The  embedding  in  our  proof  of  Proposition  1  does  not  serve  us  directly  in  our 
attempt  to  embed  a  large  complete  binary  tree  in  a  small  Butterfly,  since  (for  one 
thing)  it  places  the  roots  of  every  instance  of  T(m  -  1)  at  a  different  level  of  B(m); 
and  it  is  not  clear  how  to  combine  these  instances  into  a  bigger  complete  binary 
tree  with  small  dilation.  However,  the  overall  strategy  of  the  embedding  will  be 
useful  in  Section  2.2. D. 


2.2.  Optimally  Embedding  Trees  in  Butterfly  Graphs 

We  turn  now  to  the  proof  of  Theorem  1.  Specifically,  we  prove  the  following. 

For  any  integer  m,  one  can  embed  the  complete  binary  tree  T(m  + 
[logmj  -  1)  in  the  Butterfly  graph  B{m  +  3),  with  dilation  0(1). 

To  simplify  our  description,  let  q  =d,f  m  +  [logmj  -  1,  and  assume  henceforth 
that  m  is  even;  clerical  changes  will  remove  the  assumption. 

A.  The  Embedding  Strategy 

We  wish  to  embed  the  tree  T{q)  with  dilation  0(1),  in  the  smallest  Butterfly 
that  is  big  enough  to  hold  the  tree,  namely,  B[m).  We  fall  somewhat  short  of  this 
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goal,  but  not  by  much:  We  find  an  embedding  with  dilation  0(1),  but  we  have 
to  use  a  somewhat  larger  host  Butterfly  graph  (specifically,  B(m  +  3))  in  order  to 
resolve  collisions  in  our  embedding  procedure.  Our  embedding  proceeds  in  (bur 
stages.  Stage  1  embeds  the  top  logm  levels  of  T(q)  with  unit  dilation  in  B(m), 
thereby  specifying  implicitly  the  images  in  B(m)  of  the  roots  of  the  m/2  subtrees  of 
T(q)  rooted  at  level  logm  -  1.  Stage  2  expands  these  subtrees  a  further  m/2  levels, 
but  now  in  B(m  +  1),  with  dilation  2.  thereby  specifying  implicitly  the  images  in 
B(m)  of  the  roots  of  the  m  •  2”*/2-1  subtrees  of  T{q)  rooted  at  level  m/2  +  log  m  -  1 
of  the  tree.  In  Stage  3,  we  embed  the  final  m/2  levels  of  T(q)  in  B(m  +  1),  with 
dilation  4.  The  vertex-mappings  in  each  stage  are  embeddings  (i.e.,  are  one-to-one); 
there  is,  however,  “overlap”  (i.e.,  distinct  vertices  of  T(q)  getting  mapped  to  the 
same  vertex  of  B(m  +  1))  among  the  mappings  of  the  three  stages.  In  Stage  4,  we 
eliminate  this  overlap  by  expanding  the  host  Butterfly  by  two  more  levels,  thereby 
giving  us  four  connected  isomorphic  copies  of  B(m  +  1).  At  the  cost  of  increasing 
dilation  by  2,  we  modify  our  mapping  so  that  each  of  Stages  1,  2,  3  is  performed  in 
a  distinct  copy  of  B(m  4- 1),  thereby  eliminating  all  overlap. 

B.  Stage  1:  The  Top  logm  Levels  of  T(q) 

We  place  the  root  of  T[m  +  logm)  at  position 

(m-logm,  Om) 

of  B{m).  We  then  proceed  to  higher-numbered  levels,  embedding  the  top  logm 
levels  of  T(q)  as  a  subgraph  of  B(m),  ending  up  with  the  leaves  of  these  levels  in 
positions 

of  B(m)  (because  of  wraparound).  See  Fig.  3.  We  call  the  rightmost  logm  -  1  bits 
of  each  of  the  resulting  PWL  strings  the  signature  of  the  Butterfly  position  and  of 
the  subtree  rooted  at  that  position.  It  is  convenient  to  interpret  a  signature  as  an 
integer  in  the  range  {0, 1, •  •  • , m/2  -  l},  as  well  as  a  bit  string. 

The  embedding  in  Stage  1  is  trivially  one-to-one,  with  unit  dilation. 

C.  Stage  2:  The  Next  m/2  Levels  of  T(q) 

Call  the  (m/2  +  l)-level  subtree  of  T(q)  that  has  signature  k,  the  kth  subtree. 
Our  goal  is  to  embed  the  fcth  subtree  in  B{m  +  1)  (with  dilation  2),  so  that  its  2m/i 
leaves  form  the  set  of  positions5 

(m  —  1,  *0  *  0  •  •  •  *  0  *  1  *  0  •  •  ■  *  0  *  0?), 

^The  last  bit  position  is  not  affected  by  this  Stage,  so  is  denoted 
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where  the  1  appears  in  the  kth  even  position  from  the  right  (using  O-based  counting); 
call  this  the  signatory  ]  of  the  tree  position.  For  instance,  when  m  -  8,  the  second 
subtree  has  leaves  in  positions 

<7,  *0*1*0  *  0?) 

of  /?(9).  We  embed  these  (m/2  +  l)-ievel  trees  by  alternating  binary  and  unary 
branchings  in  B(m+  1),  starting  at  the  “roots”  placed  at  level-0  vertices  of  9(m  ■+  1 ) 
during  Stage  1;  we  place  a  tree-vertex  after  each  unary  branching.  See  Fig.  I. 
llinary  branchings  generate  the  *’s  in  the  code  for  the  set  of  PWL  strings,  while 
unary  branchings  generate  the  0’s  and  l’s  in  the  code.  As  a  simple  example:  a 
binary  branching  from  vertex 

(0,  000000011), 

which  holds  the  root  of  one  of  the  subtrees  planted  during  Stage  1,  generates  vertices 

(1,  *00000011); 

a  unary  branching  thence  generates  vertices 

(2,  *00000011), 

where  we  place  the  level- 1  vertices  of  the  subtree;  a  second  binary  branching  gen¬ 
erates  vertices 

(3,  *0  *  000011); 

a  unary  branching  thence  generates  vertices 

(4,  *0  *  100011), 

where  we  place  the  level-2  vertices  of  the  subtree;  a  subsequent  sequence  of  alter¬ 
nating  binary  and  unary  branchings  finally  embeds  the  desired  set  of  leaf  positions 
in  the  advertised  vertices  of  J9(m  +  1). 

This  stage  of  our  embedding  clearly  has  dilation  2.  The  fact  that  that  this  stage 
is  one-to-one  (though  it  may  produce  conflicts  with  the  embedding  from  Stage  1) 
has  two  origins.  First,  we  are  using  levels  0  through  m  of  B{m  +  1)  for  the  m  +  1 
levels  of  this  stage,  so  the  leaves  of  the  embedded  trees  do  not  wrap  around  to 
conflict  with  their  roots.  Second,  each  signatory  1,  whose  placement  identifies  its 
respective  tree,  is  set  “on”  before  the  signature  bits  are  reached  and  altered  by  the 
sequence  of  branchings.  This  is  ensured  by  the  fact  that  we  place  the  signatory  1 
by  counting  from  the  right:  the  signature  bits  occupy  the  rightmost  logm  -  1  bits 


Figure  4:  A  logical  view  of  the  next  m/2  levels  of  the  embedding 


of  the  PWL  string;  by  the  time  the  branchings  have  reached  the  tlh  bit  from  the 
right,  only  the  rightmost  (log:)  bits  of  the  signature  are  needed  to  specify  the  next, 
position  where  branching  occurs.  Hence,  at  the  point  when  we  place  the  signatory 
I  in  the  t11'  position,  the  odd-numbered  positions  to  the  left  of  the  1  are  all  0,  and 
the  positions  to  the  right  of  the  1  form  the  binary  representation  of  i,  possibly  with 
leading  0’s. 

D.  Stage  3:  The  Final  m/2  Levels  of  T(q) 

Our  goal  in  Stage  3  is  to  use  the  m  ■  2m/2-1  leaves  of  the  m/2  trees  generated 
in  Stage  2  as  the  roots  of  the  (m/2  +  l)-level  subtrees  comprising  the  bottom  m/2 
levels  of  T{q).  Each  root  has  a  signatory  1,  identifying  the  subtree  it  came  from 
in  Stage  2,  and  a  serial  number  obtained  from  the  odd-numbered  bits  of  its  PWL 
string.  The  signatory  l’s  will  keep  trees  sired  by  different  Stage-2  trees  disjoint;  the 
serial  numbers  will  guard  against  collisions  among  trees  that  were  sired  by  the  same 
Stage- 2  tree.  The  main  challenge  here  is  to  achieve  the  embedding  while  the  roots 
of  all  the  trees  reside  at  the  same  level  of  B{m  +  1)  (which  is  how  Stage  2  has  placed 
them).  To  accomplish  this,  we  have  the  trees  grow  upward ,  in  the  direction  of  lower 
level-numbers,  for  varying  amounts  of  time,  before  starting  to  grow  downward ,  in 
the  direction  of  higher  level-numbers.  While  growing  either  upward  or  downward, 
a  tree  grows  via  alternating  unary  and  binary  branchings,  so  as  to  preserve,  the 
serial  number.  I  bis  alternation  will  incur  dilation  2.  An  additional  dilation  of  2  is 
incurred  while  a  tree  grows  upward:  each  tree  begins  to  grows  upward  using  only 
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Figure  5:  A  logical  view  of  the  final  m/2  levels 

every  fourth  level  of  B[m  4-  l);  when  it  “turns”  from  growing  upward  to  growing 
downward,  it  uses  the  levels  it  has  skipped  while  moving  upward  to  regain  level  0 
of  B{in  f  1),  at  which  time  it  grows  downward  using  every  other  level  of  B(m  +  1). 
See  Fig.  5.  Thus,  in  all,  this  Stage  of  the  embedding  incurs  dilation  4. 

All  trees  with  the  same  signatory  1  (i.e.,  rooted  at  the  leaves  of  the  same  Stage-2 
tree)  will  grow  in  lockstep.  We  refer  to  the  trees  sharing  a  signatory  1  in  the  kth 
even  bit-position  as  the  kth  subtrees  of  T(g),  0  <  k  <  m/2.  We  place  the  vertices  of 
the  kth  subtrees  of  T(q)  into  B(m  4-  1)  as  follows: 

•  For  the  0th  trees,  we  place  the  2*  level-f  vertices  of  T(q)  at  level  21  of  B{m  +  1 ). 
(Thus,  these  trees  grow  downward  immediately.) 

•  For  the  kth  trees,  k  ">  0: 

-  we  place  their  unique  level-0  vertex  at  level  0  of  B(m  +  1)  (in  fact  this 
was  placed  during  Stage  2) 
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-  for  1  <  £  <  [k/ 2j,  we  place  their  2*  level-£  vertices  at  level  m  -  4£  +  1  of 
B(m  +  1) 

-  if  k  is  odd .  we  place  their  leveI-({’Ac/2'| )  vertices  at  level  m— 4ffc/2]  + 

3  of  B(m  +  1) 

-  for  \k/2]  +  1  <  £  <  k,  we  place  their  2*  leve!-£  vertices  at  level  m  -  4(fc  - 
£)  -  1  of  B{in  +  1) 

Now  we  verify  that  the  described  mapping  is  one-to-one,  hence  an  embedding. 
We  consider  separately  the  two  potential  sources  of  collisions. 

First,  we  note  that  there  can  be  no  collisions  among  the  2m/2  kth  trees,  for  any 
k,  since  each  of  these  trees  has  a  unique  serial  number. 

Second,  we  note  that,  for  each  fixed  serial  number,  there  can  be  no  collision 
between  the  jth  and  kth  trees  having  that  serial  number.  This  is  argued  most  easily 
hv  considering  how  such  trees  are  laid  out  level  by  level.  To  simplify  exposition,  we 
present,  only  the  even  bit-positions  of  the  image  vertices  in  B(m+  1),  since  the  odd 
bit-positions  hold  identical  serial  numbers.  Note  first  that  the  top  k  levels  of  each 
klh  tree  are  placed  in  vertices  of  the  form 

(£,  0m/2-*l**> 

in  B{m  +  1);  hence,  their  membership  in  a  kth  tree  is  announced  by  the  leftmost 
m/2  -  k  +  1  even  bit-positions  of  the  PWL  strings.  For  tree-levels  >  k,  the  jih  and 
klU  trees  are  distinguished  as  follows.  Say,  with  no  loss  of  generality,  that  j  <  k. 
For  each  0  <  £  m/2  -  k,  the  leve!-(A:  4-  £)  vertices  of  each  kth  tree  are  placed  at 

vertices 

(£,  **0m/2~*-‘l0*) 

of  B(m  +  1).  By  the  same  token,  for  each  0  <  £  <  m/2  -  j,  the  !evel-(j  +  £)  vertices 
of  each  jth  tree  are  placed  at  vertices 

{£,  *‘om/2-;-‘i(y> 

of  B(m  4  1).  Since  j  <  k  by  hypothesis,  we  see  that,  at  those  levels  of  B{m  +  1) 
where  we  place  vertices  of  both  trees,  the  fcth  even  bit-position  from  the  right  of  each 
tree  contains  a  1,  while  the  corresponding  bit-position  of  each  jtu  tree  contains 

a  0. 

Thus,  the  mapping  in  this  stage  is  an  embedding. 

E.  Resolving  Collisions 
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Figure  6:  Replicating  B(m  +  1)  to  avoid  collisions 


We  now  have  three  subembeddings  that  accomplish  the  desired  task,  except  for 
the  fact  that  Stage  i  and  Stage  j  may  map  different,  tree  vertices  to  the  same  But¬ 
terfly  vertex.  We  resolve  these  possible  collisions  as  follows.  Instead  of  performing 
the  subembeddings  in  B(m  +  1),  we  perform  them  in  B(m  +  3),  placing  each  sube¬ 
mbedding  in  a  distinct  copy  of  B(tn  f  1).  We  make  the  transition  between  copies 
of  B(m  M)  as  follows.  As  the  Stage-l  embedding  of  the  top  of  T(q)  reaches  level 
m  I  of  its  copy  of  B(m  +  1),  we  use  a  sequence  of  unary  branchings  in  B(m  +  3)  to 
reach  level  0  of  the  next  copy  of  B(m  +  I).  We  perform  the  Stage-2  subembedding 
within  this  second  copy;  this  takes  us  to  level  m  —  1  of  that  copy,  where  a  sequence 
of  unary  branchings  in  B(m  +  3)  takes  us  to  level  0  of  the  third  copy  of  B(m  +  1). 
We  perform  the  Stage-3  subembedding  in  this  third  copy.  See  Fig.  6.  The  transi¬ 
tion  from  level  m  -  1  of  the  second  copy  of  B(m  -I-  1)  to  level  0  of  the  third  copy 
engenders  dilation  4. 

The  embedding,  hence  the  proof,  is  now  complete.  □ 

2.3.  The  Issue  of  Optimality 

Theorem  1  settles  for  an  embedding  of  complete  binary  trees  in  Butterfly  graphs, 
that  achieves  dilation  0(1)  and  expansion  0(1)  simultaneously.  While  this  achieves 
our  overall  goal  of  optimality  to  within  constant  factors,  it  does  leave  open  the 
possibility  of  those  constant-factor  improvements.  We  have  been  unable  to  deter¬ 
mine  exact  dilation-expansion  tradeoffs  for  embeddings  of  complete  binary  trees  in 
Butterfly  graphs,  but  we  can  show  easily  that  it  is  impossible  to  optimize  both  cost 
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measures  simultaneously.  Thus,  one  cannot  hope  for  the  level  of  “perfection”  found 
in,  say,  [GHR]6. 


Proposition  2  No  embedding  of  T[q)  in  B(m  +  1)  has  unit  dilation. 

Proof.  Both  complete  binary  trees  and  Butterfly  graphs  are  bipartite  graphs:  one 
can  color  the  vertices  of  either  graph  red  and  blue  in  such  a  way  that  every  edge 
connects  a  red  vertex  and  a  blue  one.  For  any  Butterfly  graph  B(r),  on  the  one 
hand,  the  numbers  of  red  and  blue  vertices  are  within  r  of  being  equal;  for  any 
complete  binary  tree,  on  the  other  hand,  one  of  the  sets  has  roughly  twice  as  many 
vertices  as  the  other.  Thus,  one  cannot  find  a  unit-dilation  embedding  of  a  complete 
binary  tree  in  the  smallest  Butterfly  graph  that  has  enough  vertices  to  hold  it.  n 


3.  UPPER  BOUNDS  -  THEOREM  2 

This  section  is. devoted  to  proving  Theorem  2.  Since  all  of  the  relevant  ideas  in  the 
proof  are  present  in  its  application  to  specific  families  of  graphs,  we  actually  prove 
only  me  upper  bound  of  Corollary  1.  The  reader  should  be  able  to  generalize  easily 
to  arbitrary  families  of  graphs,  thereby  proving  Theorem  2.  For  the  remainder  of 
the  Section,  we  therefore  focus  on  the  problem  of  embedding  X-trees  in  Butterflies. 

Our  embedding  of  the  X-tree  in  the  Butterfly  graph  is  indirect:  First  we  find  a 
unit-expansion,  dilation-0(log  log  n)  embedding  of  X(h)  in  T(h).  Then  we  compose 
this  embedding  with  the  expansion-O(l),  dilation-O(l)  embedding  of  T(h)  in  B(m) 
from  Theorem  1,  to  obtain  the  upper  bound  of  Theorem  2.  We  discuss  here  only  the 
former  embedding,  which,  in  fact,  embeds  the  X-tree  A'(m)  in  the  complete  binary 
tree  T(m).  For  notational  simplicity,  let  n  =a«f  2m+1  -  1,  the  number  of  vertices  in 
X{m).  We  devote  this  section  to  proving  the  following. 

Proposition  3  For  any  integer  m,  one  can  embed  the  X-tree  A”(m)  in  the  complete 
binary  tree  T{rn),  with  dilation  O(logm)  =  O(loglogn). 

Using  the  obvious  fact  that  the  n-vertex  X-tree  can  be  bisected  (in  the  sense  of 
statement  1  above)  by  removing  O(logn)  edges,  coupled  with  techniques  in  Section 
I  of  |Bl,|,  the  reader  can  easily  prove  the  following. 

‘Mu  |CIIH|  a  variant  of  Z?(m)  with  no  wraparound  is  embedded  in  the  Hypercnbe  with  unit 
dilation  and  optimal  expansion. 
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Lemma  1  For  all  positive  integers  n,k,  the  n-vertez  X-tree  has  a  k-color  y/2- 
bifurcator  of  size  S  =  2k  •  logn. 


Proof  of  Proposition  3.  Our  embedding  uses  the  following  auxiliary  structure,  which 
appears  (in  slightly  different  form)  in  [BCLR].  A  bucket  tree  is  a  complete  binary 
tree,  each  of  whose  level-£  vertices  has  (bucket)  capacity 

C  ■  log  (§) 

for  some  fixed  constant  c  to  be  chosen  later  (in  Lemma  2).  We  embed  A'(m) 
in  T(m)  in  two  stages:  First,  we  embed  X(m)  in  a  bucket  tree,  via  a  many-to-one 
function  p  that  “respects”  bucket  capacities  (always  placing  precisely  c-log((2rm+1  - 
l)/2*)  vertices  of  A'(m)  in  each  level-f!  vertex  of  the  bucket  tree)  and  has  constant 
“dilation”.  Then  we  “spread”  the  contents  of  the  bucket  tree’s  buckets  within  T(m), 
to  achieve  an  embedding  of  X(m)  in  T(m),  with  the  claimed  dilation.  Formally, 
the  first  stage  of  the  embedding  is  described  as  follows. 


Lemma  2  Every  X-tree  X(m)  can  be  mapped  onto  a  bucket  tree  in  such  a  way  that: 
(a)  exactly 


N(C)  =  14  log 


2m+ 1  -  1\ 
2*  ) 


+  24 


vertices  of  A f(m)  are  mapped  to  each  level-t  vertex  of  the  bucket  tree,  and 

(b)  vertices  that  are  adjacent  in  A'(m)  are  mapped  to  buckets  that  are  at  most 

distance  5  apart  in  the  bucket  tree. 


The  constants  in  the  expression  for  N(C)  can  be  reduced  by  increasing 
the  constant  5  in  part  (b)  of  the  Lemma  (say,  to  10).  We  suffer  the 
larger  constants  in  order  to  simplify  the  technical  development  in  the 
proof.  The  interested  reader  can  easily  mimic  our  development  with 
other  constants. 


Proof.  The  basic  idea  is  to  recursively  bisect  X(m),  using  a  5-color  \/2-bifurcator 
(the  uses  of  the  colors  will  become  clear  momentarily),  placing  successively  smaller 
sets  of  \/2-bifurcator  vertices  in  lower-level  buckets  of  the  bucket  tree.  We  also 
place  other  vertices  in  the  buckets,  in  order  to  ensure  the  desired  “dilation”  and 
in  order  to  ensure  that  all  buckets  are  filled  to  capacity.  The  formal  description 
of  the  mapping  will  require  two  iterations.  First,  we  present  a  mapping  procedure 
that  establishes  the  sufficiency  of  the  quantities  N(i)  as  bucket  capacities.  Then 
we  refine  the  initial  mapping  to  complete  the  proof. 


We  simplify  our  description  of  this  technically  cumbersome  procedure  in  two 
ways.  First,  we  describe  in  detail  what  the  procedure  would  look  like  if  we  were 
using  S-color  bifurcators  rather  than  5-color  bifurcators;  the  reader  should  be  able 
to  extrapolate  from  oor  description  to  arbitrray  numbers  of  colors.  Second,  we 
establish  the  following  notation. 

•  We  denote  by  Bx,  where  A  denotes  the  null  string  (i.e.,  the  string  of  length  0) 
over  the  alphabet  {1,2},  the  bucket  at  the  root  of  the  bucket  tree. 

•  In  general,  letting  x  denote  any  string  over  the  alphabet  {1,2},  we  denote 
by  Bz |  and  Bz2  the  buckets  at  the  children  of  the  vertex  of  the  bucket  tree 
having  bucket  Bz,  for  example,  B\  and  B2  denote  the  buckets  at  the  children 
of  the  root  vertex  of  the  bucket  tree,  Bu  and  B\2  denote  the  buckets  at  the 
left  grandchildren  of  the  root  vertex,  B2\  and  B22  denote  the  buckets  at  the 
right  grandchildren  of  the  root  vertex,  and  so  on. 

Algorithm  Bucket:  Mapping  X(m)  into  a  bucket  tree 

Step  1.  Initial  coloring  and  bisection. 

l.a.  Initialize  every  vertex  of  X(m)  to  color  A. 

l.b.  Associate7  the  graph  X(m)  with  the  root  of  the  bucket  tree. 

l.c.  Bisect  A'(m),  to  obtain  subgraphs  X\  and  X2 ,  and  place  the  \/2-bifurcator 
vertices  in  bucket  Bx. 

l.d.  Recolor  every  ^-colored  vertex  of  X[m)  that  is  adjacent  to  a  vertex  in 
bucket  Bx  with  color  0. 

1. e.  Associate  Xi  (t  €  {1,2})  with  the  child  of  the  root  vertex  of  the  bucket 

tree  holding  bucket  B,. 

Step  2.  Second-level  bisection. 

2. a.  Use  a  2-color  v^-bifurcator  for  each  Xi,  to  create  subgraphs  Xu  and 

Xt2. 

2.b.  Place  the  v^2-bifurcator  vertices  for  each  AT,  in  the  corresponding  bucket 
B ,  of  the  bucket  tree. 

2.c.  Recolor  every  A-coIored  vertex  of  X(m)  that  is  adjacent  to  a  vertex  in 
bucket  Bi  with  color  1. 

"Tin*  “associations"  here  are  intended  to  make  it  easier  for  the  reader  to  follow  owr  description 
of  the  mapping. 
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2. d.  For  each  Xt,  associate  each  subgraph  Xi,  with  the  %/2-bifurcator-tree 

vertex  associated  with  bucket  Bij. 

Step  3.  Third-level  bisection. 

3.  a.  Use  a  3-color  \/2-bifurcator  for  each  XtJ,  to  create  subgraphs  A'ui  and 

X,„. 

3.b.  Place  the  \/2-bifurcator  vertices  for  each  X, ;  in  the  corresponding  bucket 
Bl}  of  the  bucket  tree. 

3.c.  Recolor  every  A-colored  vertex  of  X(m)  that  is  adjacent  to  a  vertex  in 
bucket  B,j  with  color  0. 

3.d.  For. each  XtJ,  associate  each  subgraph  Xl;*  with  the  v^-bifurcator-tree 
vertex  associated  with  bucket  Bt)k- 

Step  s.  (4  <  s  <  m)  All  remaining  bisections. 

s. a.  For  each  subgraph  Xy  ( y  €  {1,2}')  of  X(m)  created  in  Step  s  -  1,  place 
every  vertex  of  color  s  (mod  2)  in  the  associated  bucket  Bv. 

s.b.  Use  a  3-color  i/2-bifurcator  for  each  Xv ,  to  create  subgraphs  Xv\  and 

X„2. 

s. c.  Place  the  \/2-bifurcator  vertices  for  each  Xv  in  the  corresponding  bucket. 
Bv  of  the  bucket  tree. 

s.d.  Recolor  every  A-colored  vertex  of  X(m)  that  is  adjacent  to  a  vertex  in 
bucket  B„  with  color  length(y)  (mod  2). 

s.e.  For  each  Xv,  associate  each  subgraph  Xvt  with  the  \/2-bifurcator-tree 
vertex  associated  with  bucket 

We  now  analyze  5-color  analogue  of  the  described  mapping ,  to  show  that  it 
satisfies  the  demands  of  Lemma  2,  with  the  requirement  of  “exactly”  N(£)  vertices 
per  level-£  bucket  replaced  by  “no  more  than”  N(t)  vertices  per  IeveI-£  bucket,  i.e., 
to  show  that  our  bucket  capacities  are  big  enough.  Since  the  “dilation”  condition 
(b)  is  transparently  enforced  when  certain  colored  vertices  are  automatically  placed 
in  buckets  (in  Step  s.a),  it  will  suffice  to  establish  that  the  populations  of  the 
buckets  are  as  indicated  in  the  modified  condition  (a).  This  follows  by  the  following 
recurrence,  wherein  N(k)  denotes  the  number  of  vertices  of  X(m)  that  get  mapped 
into  a  bucket  at  level  k  -  1  of  the  bucket  tree. 

"<*)  5  fHw(fI"5)l+6log(?) 

-  T5/v(*-5l  +  10log(£) 
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with  initial  conditions 


•  1V(  1)  <  2  log  n 

•  W(2)<4log(=) 

.  JV(3)  <  Slog  (a) 

.  N(4)  <  8  log  (|) 

.  JV( 5)  <  10 log  (ft) 

The  initial  conditions  reflect  the  sizes  of  the  appropriately  colored  \/2-bifurcators 
of  X(m):  At  each  level  l,  1  <  l  <  4,  one  uses  an  ^-colored  \/2-bifurcator,  followed 
by  a  5-color  \/2-bifurcator  at  all  subsequent  levels.  At  levels  s  >  2,  the  buckets 
contain  not  only  \/2-bifurcator  vertices,  which  account  for  the  term 

10  '°g  (?) 

in  the  general  recurrence;  they  contain  also  the  vertices  of  X(m)  that  are  placed 
to  satisfy  the  “dilation”  requirements.  The  latter  vertices  comprise  all  neighbors  of 
the  N(k  -  5)  occupants  of  the  distance-4  ancestor  bucket  that  have  not  yet  been 
placed  in  any  other  bucket.  Since  vertices  of  X[m)  can  have  no  more  than  five 
neighbors,  and  since  our  5-color  bisections  allocate  these  neighbors  equally  among 
the  descendants  of  a  given  bucket,  these  “dilation” -generated  vertices  can  be  no 
more  than 

*£*(*-*) 

in  number.  These  two  sources,  the  \/2-bifurcators  and  their  neighbors,  account  for 
the  occupants  of  the  buckets  and  for  the  recurrence  counting  them.  To  complete 
ih"  proof  of  the  modified  Lemma,  one  now  shows  by  standard  techniques  that  the 
indicated  recurrence,  with  the  indicated  initial  conditions,  has  the  solution 

N[k)  <  14  log  +  24. 

Finally,  we  turn  to  the  original  form  of  the  Lemma.  This  follows  from  the 
modified  form,  upon  refining  the  Algorithm  by  adding  the  following  substeps  at  the 
indicated  points. 
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Figure  7:  Unloading  the  buckets 

At  the  end  of  each  step  of  the  Algorithm,  when  we  have  finished  filling 
a  bucket  Bz  (a:  €  {1,2}')  with  vertices  obtained  from  a  recent  bisection 
or  from  our  desire  to  maintain  small  “dilation”,  we  check  the  population 
of  the  bucket  against  the  ceiling  population  iV(£),  where  t  ~  lenglh(x). 

If  the  bucket  contains  fewer  than  N{t ),  vertices,  then  we  add  enough 
new  vertices  to  it  from  the  remaining  associated  subgraph  to  fill  it  to 
capacity. 

This  last  step  ensures  that  all  buckets  at  level  t  of  the  bucket  tree  contain  exactly 
N{1)  vertices.  □ 

Our  final  task  is  to  refine  the  “dilation” -5  mapping  of  Lemma  2  to  a  bona  fide 
embedding  of  X(m)  in  T(m),  having  dilation  0(log  logn).  We  proceed  inductively, 
emptying  buckets  into  T[m)  in  such  a  way  that  each  tree  vertex  is  assigned  a 
unique  X-tree  vertex.  In  general,  we  denote  by  Tx  the  smallest  subtree  of  T{m) 
that  is  rooted  at  level  lcngth(x)  of  T[m)  and  that  contains  the  contents  of  bucket 
Bz.  (In  general,  the  contents  of  Bx  will  occupy  only  the  last  few  levels  of  Tx.)  See 
Fig.  7. 

•  Place  the  logn  elements  of  bucket  B\  in  the  topmost  copy  of  r(loglogn)  in 
T(m),  in  any  way. 

•  Consider  the  subtrees  of  Tx  rooted  at  level  I  of  r(m).  Place  the  contents  of 
bucket  Bi  in  the  (roughly)  log  log  n  levels  of  the  leftmore  of  these  two  subtrees, 
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starting  immediately  after  the  leaves  of  T\.  Place  the  contents  of  bucket  f?2 
analogously,  using  the  rightmore  of  these  two  subtrees,  starting  immediately 
after  the  leaves  of  T\.  We  have  thus  implicitly  defined  the  subtrees  T\  and  r2. 

Note  that  by  this  point,  we  are  using  enough  of  the  top  levels  of  T’(m)  that  we  need 
use  only  one  more  level  in  order  to  place  the  contents  of  the  next  level  of  buckets. 
The  importance  of  this  fact  is  that  it  guarantees  that  all  of  the  subtrees  Tz  will 
have  height  O(loglogn).  (Namely,  T\,  Tj,  and  T2  have  the  desired  height,  and  all 
subsequent  trees  will  result  from  adding  one  level  of  leaves  to  a  tree  whose  root  is 
one  level  lower  in  T(m)  than  was  its  father’s  root.) 

•  Proceeding  inductively,  assume  that  we  have  filled  subtrees  Tz  of  T(m)  with 
bucket  contents,  for  strings  x  £  {1,2}“  of  length  <  l.  We  now  consider 
the  subtrees  of  T(m)  rooted  at  level  t  +  1;  each  subtree  Tt  rooted  at  level 
t  thus  spawns  two  children.  We  order  these  2*+l  subtrees  from  left  to  right, 
according  to  the  lexicographic  order  on  the  subscript-strings  x.  We  then  place 
the  contents  of  the  bucket  Bzi  in  the  leaves  of  the  leftmore  of  the  children  of 
Tz,  beginning  where  the  contents  of  bucket  Bz  left  off.  Analogously,  we  place 
the  contents  of  the  bucket  Bxj  in  the  leaves  of  the  rightmore  of  the  children 
of  Tt,  beginning  where  the  contents  of  bucket  Bz  left  off. 

The  described  procedure  clearly  produces  an  embedding  of  A'(m)  in  T(m),  since 
each  vertex  of  .Y(m)  is  assigned  to  a  unique  tree  vertex.  Additionally,  the  em¬ 
bedding  has  unit  expansion  since  no  tree  vertices  are  passed  over  in  the  assign¬ 
ment  process  and  since  all  buckets  at  each  level  i  have  the  same  population  N(f.) 
(so  all  subtrees  Tz  are  isomorphic).  Finally,  the  procedure’s  method  of  spreading 
bucket  contents  throughout  T(m)  produces  an  embedding  with  the  desired  dilation, 
namely,  O(loglogn).  Specifically,  by  always  spreading  the  contents  of  buckets  Bzi 
and  Bzi  in  the  leaves  of  the  left  and  right  subtrees  of  the  depth-0(loglog  n)  subtree 
that  contains  the  contents  of  bucket  Bz,  the  procedure  guarantees  that  the  least 
common  ancestor,  in  T{m),  of  the  set  comprising  the  contents  of  any  bucket  plus 
the  vertices  in  buckets  at  most  five  buckets  up  (which  will  lie  in  adjacent  levels 
+  1,  k  +  2,  k  +  3,  k  +  4,k  +  5  of  the  bucket  tree)  are  always  within  a  subtree 
of  height  O(loglogn)  of  T{m).  Thus,  we  have  produced  the  desired  embedding, 
thereby  proving  Proposition  2,  hence  Theorem  2.  □ 


4.  LOWER  BOUNDS  -  THEOREM  3 

We  demonstrate  the  near-optimality  (to  within  constant  factors)  of  the  embeddings 
of  Section  3  -  in  fact,  true  optimality  for  X-trees  -  by  proving  the  lower  bound  of 
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Theorem  3.  In  contrast  with  Theorem  2,  Theorem  3  is  most  easily  proved  in  its  full 
generality. 

Assume  henceforth  that  we  are  given  a  planar  graph  G,  a  planar  embedding  ( 
of  G,  and  a  minimum-dilation  embedding  p  of  G  in  £(p);  let  p  have  dilation  6. 

We  begin  by  noting  that  we  can  simplify  our  quest  somewhat.  Specifically,  since 
we  aim  only  for  bounds  that  hold  up  to  constant  factors,  we  lose  no  generality  by 
assuming  henceforth  that  (in  the  embedding  6)  the  exterior  face  of  G  is  a  simple 
cycle: 

Lemma  3  One  can  add  edges  to  the  graph  G  within  the  embedding  e  in  such  a  way 
that 

•  the  resulting  embedding  e!  is  a  planar  embedding  of  the  resulting  graph  G' 

•  in  the  embedding  d ,  the  exterior  face  of  G'  is  a  simple  cycle 

•  E(G')  =  0(£(G)) 

•  <MG')  =-max(3,$t(G)). 

Proof  Sketch.  If  the  exterior  face  of  G  is  not  a  simple  cycle,  it  is  because  of  cut- 
edges  and/or  pinch-vertices.  We  take  each  cut-edge  in  turn  and  create  a  triangle 
containing  it  as  an  edge;  then  we  repeat  the  process  with  any  remaining  cut-edge. 
When  no  more  cut-edges  exist,  we  eliminate  each  pinch-vertex  in  turn  by  creating 
a  triangle  that  includes  the  pinch-vertex  as  a  vertex.  Since  each  added  edge  creates 
a  triangle  and  spans  only  two  edges  of  G,  the  claims  about  $(G')  and  E(G')  are 
immediate.  □ 

A  consequence  of  Lemma  3  is  that  we  may  henceforth  assume  that  every  edge 
of  G  resides  in  some  interior  face  (in  the  embedding  e). 

We  turn  now  to  the  quantitative  consequences  of  Lemma  3. 

A  set  of  faces  of  G  is  connected  in  the  embedding  e  just  when  their  corresponding 
vertices  are  connected  in  the  graph  r(G;e)  whose  vertices  are  the  faces  of  G  and 
whose  edges  connect  a  pair  of  face-vertices  just  when  the  faxes  share  a  vertex.  A 
set  5  of  vertices  of  G  is  face- connected  (in  c)  if  the  set  of  interior  faces  of  G  that 
contain  one  or  more  of  the  vertices  of  5  is  connected. 

Let  A  be  a  connected  component  of  the  graph  G  remaining  after  removing  a 
set  5  of  vertices  from  G.  The  S -boundary  of  A  is  the  set  of  vertices  of  A  that  are 
adjacent  (in  G)  to  vertices  of  5. 
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Lemma  4  If  one  removes  a  face- connected  set  of  vertices  5  from  the  graph  G, 
then  the  S -boundary  of  every  resulting  maximal  connected  component  of  G  is  face- 
connected. 

Proof.  Consider  a  maximal  connected  component  A  remaining  after  removing  5 
from  G.  Assume  for  contradiction  that  the  set  of  5-boundary  vertices  of  A  is 
not  face-connected.  There  must  then  be  at  least  two  distinct  maximal  connected 
components,  call  them  F\  and  F2,  of  interior  faces  that  contain  boundary  vertices 
(so  Fi  U  F2  is  not  connected).  Let  /,,  i  =  1,2,  be  an  interior  face  in  component  F,, 
and  let  6,  be  a  boundary  vertex  in  face  /,.  Since  each  edge  of  G  lies  in  an  interior 
face,  we  can  choose  each  /,  to  contain  a  vertex  of  5  as  well  as  a  boundary  vertex. 

Fact  1  There  is  a  connected  set  I  of  interior  faces,  none  of  which  contains  a  bound¬ 
ary  vertex,  such  that  I  separates  f\  from  /2. 

Verification.  It  is  not  possible  for  both  F\  to  encircle  /2  and  F2  to  encircle  f\,  since 
then  F\  and  F2  would  intersect  (so  /t  and  /2  would  be  connected  by  interior  faces 
containing  boundary  vertices).  Without  loss  of  generality,  say  that  F\  does  not 
encircle  /2. 

Let  J  be  the  set  of  interior  faces  that  do  not  contain  boundary  vertices  and  that, 
are  incident  to  the  outer  boundary  of  Fi  (so  that  /2  is  on  the  outside).  By  definition, 
the  set  J  separates  fi  from  /2.  If  J  is  connected,  then  it  is  the  desired  set  /.  If  J 
is  not  connected,  then  adding  the  exterior  face  of  G  to  J  yields  a  connected  set  J'. 
Moreover,  /2  must  lie  in  one  of  the  simply  connected  regions  J"  of  J'.  Deleting  the 
exterior  face  from  J"  then  yields  the  desired  set  /;  see  Fig.  8. 

Fact.  2  I  contains  a  vertex  of  A  and  a  vertex  of  5. 

Verification.  I  separates  f\  from  /2,  yet:  f\  and  /2  both  contain  vertices  of  both  A 
and  5;  both  A  and  5  are  face-connected  in  G. 

Since  I  contains  vertices  of  A  and  5  and  is  connected,  and  since  5  separates  the 
connected  set  A  from  the  rest  of  G,  the  set  I  must  contain  at  least  one  face  that 
contains  both  a  vertex  from  A  and  a  vertex  from  5.  Such  a  face  must  also  contain 
a  vertex  of  the  5-boundary  of  A,  contradicting  Fact  1.  Lemma  4  follows.  □ 

A  set  of  vertices  5  of  a  graph  K  is  d-quasi-connected,  d  a  positive  integer,  if  for 
every  two  vertices  u,  w  of  5,  there  exists  a  chain  of  vertices 

m  =  t;„,w,,t/2,...,vk  =  w, 

of  5,  where  consecutive  vertices  v,,  vl+j  are  distance  <  d  apart  in  K. 
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(A  to  <c) 


Figure  8:  (a)  The  embedding  £,  with  the  set  J  outlined  boldly;  “x”  marks  f2 ,  and 
“Xw  marks  F2  with  the  holes  filled  in.  (b)  The  set  J'  =  J  U  (outer  face),  (c)  The 
embedding  e,  as  in  (a),  with  the  set  J"  outlined  boldly. 

Lemma  5  The  Fundamental  Lemma  for  Butterfly-Like  Graphs 
Say  that  there  is  a  subgraph  K  of  G  and  a  constant  c  such  that 

•  K  is  $(G) -quasi-connected 

•  the  image  of  K  under  the  embedding  p  lies  within  c$(C)6  consecutive  levels 
of  B{p). 

Then  6  >  a(c)  ,  where  a(c)  is  a  constant  depending  only  on  c. 

Proof.  Say  that  the  image  of  H  under  p  lies  entirely  in  levels8 

/  +  1,/  +  2  ,•••,/  +  c${G)6 

of  B{p).  Let  u  and  v  be  arbitrary  vertices  of  H  which  are  connected  by  a  path  of 
at  most  $(G)  vertices  in  G.  The  image  of  this  path  in  B[p)  must  lie  totally  within 
levels 

l  -  *{G)6  +  1,  •••,/  +  (c  +  1)*(G)$ 

of  B(p),  since  the  embedding  p  has  dilation  6.  See  Fig.  9.  Since  K  is  $(G)-quasi- 
connected,  this  means  that  the  PWL  strings  of  all  images  of  vertices  of  K  can  differ 

•  All  addition  is  modulo  p. 
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Figure  9:  Illustrating  the  Fundamental  Lemma,  with  t  =  c${G)6  and  h  =  $(G)6: 
Vertices  of  K  reside  in  region  II;  Iength-*(G)  paths  between  vertices  of  K  cannot 
extend  beyond  regions  I  or  III. 
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only  in  some  set  of  at  most  ((c  +  2)<fr((7)  +  1)6  bit  positions.  It  follows  that  l\ 
ran  contain  no  more  than  c<l>(<7)62^':t'2'*(<,'+,,,'  vertices,  i.e.,  r<&((»')/>  levels  of  //(/>), 
with  at  most  2<(,-t 2W(1) 1  ')A  vertices  per  level.  In  other  words, 

c$(G)62«e+2,*<G>+l)*  >  \K\, 


whence  the  result.  □ 

We  now  complete  the  proof  of  Theorem  3,  beginning  with  two  simple  lemmas. 

Lemma  6  Any  face-connected  set  of  vertices  of  G  is  $(G) -quasi-connected. 

Proof  Sketch.  Any  vertex  in  an  /-vertex  face  is  distance  <  // 2  from  any  neighboring 
face.  □ 

Lemma  7  Let  C  be  a  set  of  vertices  of  the  graph  G  whose  removal  partitions  G 
into  connected  components  all  of  size  <  |<j|/2.  Then  C  is  a  1/3-2/3  separator  of  G. 

Proof  Sketch.  Remove  C  from  G,  order  the  resulting  connected  components  by  size 
into  decreasing  order,  and  lump  the  components  into  two  piles  as  follows. 

•  Place  the  largest  component  into  the  left  pile. 

•  Place  as  few  of  the  largest  remaining  components  in  the  right  pile  as  possible 
until  the  right  pile  is  bigger  than  the  left. 

•  Now  alternate  piles,  adding  as  few  of  the  largest  remaining  piles  as  possible 
to  the  smaller  pile  until  the  smaller  first  becomes  bigger  than  the  larger  pile. 

Clearly,  when  one  has  completed  the  two  piles,  the  larger  cannot  be  bigger  than  the 
smaller  by  more  than  the  size  of  the  third  largest  component,  i.e.,  by  more  than 
|C|/3  vertices.  It  follows  that  each  pile  must  contain  at  least  |G|/3  vertices,  whence 
the  claim.  □ 

Theorem  3  will  now  follow  from  the  next  Lemma. 

Lemma  8  The  embedding  p  must  have  dilation  6  >  {const)  • 
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Proof.  Partition  B(p)  into  bands,  each  band  0,  being  a  sequence  of  d,6  consecutive 
levels,  2$(G)  <  d,  <  4$(G),  where  the  constants  d,  may  be  chosen  in  any  way  that 
achieves  a  partition.  Let  «(u),  the  color  of  vertex  v  of  G,  be  the  index  i  of  the  band 
/?,  in  which  p(v)  resides. 

We  perform  a  modified  breadth-first  search  of  G,  to  find  a  $(G)-quasi-connected 
component  of  size  >  E(G),  all  of  whose  vertices  have  images  in  a  single  band  of 
B(p),  hence  the  same  color.  By  Lemma  5,  the  existence  of  such  a  component  will 
yield  the  lower  bound  on  6. 

The  breadth-first  search  proceeds  as  follows.  We  select  an  arbitrary  vertex  v,t 
of  G  and  form  V„,  the  maximal  connected  component  of  G  that  contains  V|>  and 
that  consists  entirely  of  vertices  with  color  k{v0).  Since  V0  is  connected,  removing 
its  vertices  partitions  G  into  connected  components;  let  C0  be  the  largest  of  these. 
Lemmas  4  and  6  assure  us  that  the  Vo-boundary,  Bn,  of  the  component  C»  is  $(G)- 
quasi-connected.  It  follows  that 

Fact.  3  All  vertices  of  Bn  have  the  same  color. 

Verification.  Since  each  v  6  B0  is  adjacent  to  a  vertex  of  V0,  we  must  have  k(v)  € 
{*(»'„)  -  I,/c(uo)  +  I}.  Moreover,  Bn  cannot  contain  vertices  of  both  colors:  Two 
such  vertices  would  be  separated  by  the  band  $«(„,,),  contradicting  the  fact  that  Bn 
is  <J>(G)-quasi-connected. 

Next,  form  Vi,  the  maximal  monochromatic  subgraph  of  G  that  contains  both  Bn 
and  all  connected  components  of  G  that  intersect  Bn;  obviously,  Vt  is  $(G)-quasi- 
connected,  so  removing  it  partitions  G  into  some  number  of  connected  components. 
Let  C\  be  the  largest  of  these,  and  let  B\  be  the  Vrboundary  of  C,.  As  with  B„, 
one  shows  that  Bx  is  $(G)-quasi-connected  and  monochromatic. 

We  continue  in  this  fashion,  constructing,  in  turn,  for  i  =  2,3, ... ,  the  following 
subgraphs  of  G,  with  the  indicated  properties; 

•  V,:  the  ($(G)-quasi-connected)  maximal  monochromatic  subgraph  of  G  that 
contains  both  Bi_t  and  ail  connected  components  of  G  that  intersect  B*_ i 

•  Ct:  the  largest  connected  component  of  G  remaining  when  one  removes  T, 
from  G 

•  B,:  the  (<fr(G)-quasi-connected,  monochromatic)  K.-boundary  of  C, 

One  continues  this  construction  until  some  subgraph  V,  contains  at  least  H(G) 
vertices.  We  now  show  that  this  point  must  occur. 
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Fact.  4  For  some  i,  ] V, (  >  E(G). 

Verification.  Note  that  at  each  point  in  our  construction,  V,  is  whittled  out  of  the 
largest  component  C,_!  of  G  remaining  after  removal  of  Vi_l  from  G.  Moreover,  \  \ 
disconnects  the  vertices  of  Cx-\  -  V,  from  the  remainder  of  G,  as  one  can  verify  easily 
by  induction  on  ».  At  some  point,  therefore,  the  whittling  process  must  reduce  the 
size  of  the  then-current  largest  component  Cm  so  that  |C'm|  <  \G\j2.  By  Lemma  7. 
the  then-current  Vn  is  a  1/3-2/3  separator  of  G ,  hence  must  contain  at  least  £(G) 
vertices. 

The  preceding  development  gives  us  a  set  of  vertices,  of  size  >  £(G),  whose 
images  reside  in  a  single  band  of  levels  of  B(p).  By  Lemma  5,  Theorem  3  follows. 

□ 

5.  THE  COROLLARIES:  X-TREES  AND  MESHES 

Corollaries  1  and  2  now  follow  from  the  following  Lemmas. 

Lemma  9  |HR|  E(A"(/i))  =  fl(/i)  =  fl(log  |A'(/i)|),  and  $(X(h,))  -  5  (under  the 
natural  embedding ) . 

Lemma  10  (e.g.,  [HR|)  E(M(s))  =  fl(s)  =  n{y/\M(3)\) ,  and  *(M(#))  =  d  (under 
the  natural  embedding). 

6.  CONCLUDING  REMARKS 

We  close  with  some  remarks  about  extensions  to  the  research  described  here. 

The  lower  bound  of  Theorem  3  cannot  be  improved  in  general,  as  one  can  see 
from  considering  homeomorphs  of  the  mesh. 

Our  lower  bound  for  the  mesh  extends  also  to  higher-dimensional  meshes  and 
to  pyramid  graphs;  thus,  these  are  examples  of  other  popular  networks  that  embed 
efficiently  in  the  Hypercube,  but  not  in  butterfly-like  machines. 

The  lower  bound  of  Theorem  3,  which  deals  explicitly  only  with  embeddings  in 
the  Butterfly,  extends  to  embeddings  in  the  mesh  of  trees,  Cube-Connected-Cycles, 
Benes  network,  and  similar  levelled  networks. 
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We  do  not  yet  have  an  analogue  of  Theorem  3  for  embeddings  in  the  shuffle- 
exchange  and  deBruin  graphs9.  However,  using  rather  complicated  arguments,  we 
can  prove  that  any  expansion-O(l)  embedding  of  the  n-vertex  X-tree  or  the  n- vertex 
mesh  in  these  host  graphs  requires  dilation  n(loglogn).  Since  a  complete  binary 
tree  is  a  spanning  tree  of  the  deBruijn  graph,  the  proof  technique  of  Section  3  shows 
that  this  lower  bound  for  the  X-tree  is  optimal.  We  suspect  that  the  lower  bound 
for  the  mesh  can  be  improved. 

In  order  to  justify  dilation  fully  as  the  central  measure  of  concern  in  network 
embeddings,  it  would  be  nice  to  strengthen  the  results  of  Section  3  to  show  that 
the  Butterfly  can  simulate  any  graph  having  a  y/2-bifurcator  of  size  S  =  n(logrt) 
with  delay  O(logS).  We  believe  this  to  be  possible  using  the  arguments  of  Section 
2,  but  we  have  not  worked  through  the  details. 

Lastly,  it  should  be  noted  that  our  lower  bounds  do  not  mean  that  a  Butterfly 
cannot  efficiently  simulate  a  mesh  or  X-tree  efficiently  over  a  large  span  of  time.  For 
example,  a  Butterfly  can  simulate  log  n  steps  of  a  mesh  of  a  constant  fraction  smaller 
size  within  O (log  n  log  log  n)  steps,  and  possibly  within  O(logn)  steps.  Similar 
improvements  in  amortized  simulation  times  are  also  possible  for  the  X-tree,  and 
we  are  currently  studying  how  good  such  amortized  simulations  can  be  in  general. 
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Abstract 

The  Message  Driven  Processor  is  a  node  of  a  large-scale  multiprocessor  being 
developed  by  the  Concurrent  VLSI  Architecture  Group.  It  is  intended  to  support  fine¬ 
grained,  message  passing,  parallel  computation.  It  contains  several  novel  architectural 
features,  such  as  a  low-latency  network  interface,  extensive  type-checking  hardware, 
and  on-chip  memory  that  can  be  used  as  an  associative  lookup  table. 

This  document  is  a  programmer’s  guide  to  the  MDP.  It  describes  the  processor’s 
register  architecture,  instruction  set,  and  the  data  types  supported  by  the  processor.  It 
also  details  the  MDP’s  message  sending  and  exception  handling  facilities. 
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OBJECT-ORIENTED  CONCURRENT  PROGRAMMING  IN  CST 
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Abstract 

CST  is  a  programming  language  based  on  Smalltalk-80  that  supports  concurrency  using 
locks,  asynchronous  messages,  and  distributed  objects.  Distributed  objects  have  their 
state  distributed  across  many  nodes  of  a  machine,  but  are  referred  to  by  a  single  name. 
Distributed  objects  are  capable  of  processing  many  messages  simultaneously  and  can 
be  used  to  efficiently  connect  together  large  collections  of  objects.  They  can  be  used  to 
construct  a  number  of  useful  abstractions  for  concurrency.  This  paper  describes  the 
CST  language,  gives  examples  of  its  use,  and  discusses  an  initial  implementation. 
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Abstract 

This  paper  describes  micro-optimization,  a  technique  for  reducing  the  operation  count 
and  time  required  to  perform  floating-point  calculations.  Micro  optimization  involves 
breaking  floating-point  operations  into  their  constituent  micro-operations  and  optimizing 
the  resulting  code.  Exposing  micro-operations  to  the  compiler  creates  many 
opportunities  for  optimization.  Redundant  normalization  operations  can  be  eliminated  or 
combined.  Also,  scheduling  micro-operations  separately  results  allows  dependent 
operations  to  be  partially  overlapped.  A  prototype  expression  compiler  has  been  written 
to  evaluate  a  number  of  micro-optimizations.  On  a  set  of  benchmark  expressions 
operation  count  is  reduced  by  33%  and  execution  time  is  reduced  by  40%. 
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Abstract 

This  paper  describes  micro-optimization,  a  technique  for  reducing  the  operation  count  and 
time  required  to  perform  floating-point  calculations.  Micro  optimization  involves  breaking 
floating-point  operations  into  their  constituent  micro-operations  and  optimizing  the  resulting 
code.  Exposing  micro-operations  to  the  compiler  creates  many  opportunities  for  optimiza¬ 
tion.  Redundant  normalization  operations  can  be  eliminated  or  combined.  Also,  scheduling 
micro-operations  separately  results  allows  dependent  operations  to  be  partially  overlapped.  A 
prototype  expression  compiler  has  been  written  to  evaluate  a  number  of  micro-optimizations. 
On  a  set  of  benchmark  expressions  operation  count  is  reduced  by  33  %  and  execution  time  is 
reduced  by  4R  %. 


1  Introduction 

Many  unneeded  operations  are  performed  during  the  evaluation  of  floating  point  expressions 
because  existing  compilers  and  floating  point  units  consider  these  operations  to  be  atomic.  By 
decomposing  floating  point  operations  into  their  constituent  integer  micro-operations,  many 
opportunities  for  optimization  are  exposed.  Redundant  shift  operations  may  be  eliminated, 
parts  of  the  computation  may  be  done  with  a  block  exponent,  common  subexpressions  in  the 
mantissa  or  exponent  calculation  are  exposed,  and  additional  flexibility  in  scheduling  operations 
is  possible. 

This  paper  describes  methods  for  micro-optimizing  floating  point  expressions.  Each  operation  in 
the  expression  is  decomposed  into  its  primitive  integer  micro-operations.  For  example  a  floating 
point  add  is  decomposed  into  an  exponent  subtract,  mantissa  alignment,  mantissa  add,  leading 
zero’s  count,  exponent  adjust,  and  mantissa  normalization.  Optimizations  are  performed  on 
the  resulting  micro-operations.  For  example,  a  normalizing  left  shift  from  one  FP  add  may  be 
combined  with  the  aligning  right  shift  of  a  subsequent  FP  add  resulting  in  a  single  shift.  The 
entire  expression  is  scheduled  as  a  unit  resulting  in  better  hardware  utilization. 

On  a  set  of  benchmark  expressions,  micro-optimization  reduces  operation  count  by  33  %  and 
execution  time  by  40  %  compared  to  conventional  floating  point  execution  with  identical 
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Agency  under  contracts  N00014-80-C-0622  and  N00014-85-K-0124  and  in  part  by  a  National  Science  Founda¬ 
tion  Presidential  Young  Investigator  Award  with  matching  funds  from  General  Electric  Corporation  and  IBM 
Corporation. 
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function  unit  performance  and  register  bandwidth. 

To  fully  exploit  micro-optimization,  a  micro  floating  point  unit  (/iFPU)  is  required.  The  in¬ 
struction  set  of  a  jiFPU  consists  of  the  micro-operations  required  for  floating  point  arithmetic 
(e.g.,  alignment  shifts  that  maintain  guard,  round,  and  sticky  bits).  These  operations  are  per¬ 
formed  out  of  a  set  of  mantissa  and  exponent  registers.  By  providing  the  appropriate  primitive 
operations,  no  comprimises  are  made  in  terms  of  accuracy,  rounding,  adherence  to  standards, 
and  performance. 

This  work  is  motivated  by  recent  progress  on  RISC  [8]  and  VLIW  [3]  architectures.  RISC 
machines  eliminate  the  complex  addressing  modes  found  in  CISC  machines  [9].  Address  calcu¬ 
lations  are  performed  using  integer  arithmetic  instructions  rather  than  by  microcode  or  special 
hardware.  Exposing  these  calculations  to  the  compiler  often  improves  performance.  Micro 
optimization  applies  this  technique  to  floating  point  operations.  As  with  address  calculations, 
breaking  these  operations  into  their  primitive  components  has  the  disadvantage  of  decreasing 
code  density  and  increasing  instruction  bandwidth. 

Micro-optimization  borrows  from  VLIW  technology,  in  that  several  micro-operations  may  be 
performed  simultaneously.  Also,  some  of  the  optimizations  described  here  schedule  code  across 
basic  blocks.  However,  the  technique  used  is  different  from  trace  scheduling. 

The  idea  of  using  a  compiler  to  optimize  a  function  normally  considered  a  primitive  arithmetic 
operation  has  been  applied  to  integer  multiplication  by  a  constant  [5]. 

The  next  section  illustrates  the  basic  concepts  of  micro-optimization  by  means  of  a  few  simple 
examples.  A  prototype  expression  compiler  written  to  test  these  concepts  is  described  in  Section 
3.  Section  4  describes  the  architecture  of  an  exemplary  /iFPU.  The  compiler  and  /xFPU  are 
evaluated  on  a  number  of  benchmark  programs  in  Section  5. 


2  Micro-Optimizations 


This  section  illustrates  micro-optimizations  by  means  of  examples  given  in  pFP  assembly  code 
(see  Section  4).  The  code  for  a  single  add  (A  =  B  +  C)  and  a  single  multiply  (A  ■  B  *  C) 
are  shown  below.  The  subtract  operation  is  similar  to  add.  The  optimizations  start  from 
concatenations  of  these  sequences  and  perform  transformations  to  reduce  the  number  of  micro- 
operations. 
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In  this  section  optimizations  will  be  evaluated  by  comparing  the  path  lengths  of  the  optimized 
and  unoptimized  /iFP  code.  Timings  for  different  micro-operations  will  be  discussed  in  Section 
4. 

Three  instructions,  at  least  half  the  total,  in  each  sequence  are  used  to  normalize  the  result. 
Many  of  the  optimizations  described  below  are  methods  to  eliminate  unnecessary  normaliza¬ 
tions. 


Automatic  Block  Exponent 

The  alignment  operations  of  cascaded  additions  can  be  simplified  if  the  largest  exponent  is 
identified  and  used  as  a  block  exponent  for  the  additions.  All  mantissas  are  aligned  using  this 
exponent  and  added  without  normalization.  Only  the  final  sum  is  normalized. 

The  following  code  shows  an  application  of  this  technique  to  the  expression  (TO  *  A  +  B  + 
C).  Only  the  code  for  the  case  where  A  has  the  largest  exponent  is  shown.  By  eliminating  the 
normalization  and  realignment  of  the  intermediate  result,  this  path  through  the  sum  has  been 
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The  use  of  automatic  block  exponent  requires  that  extra  mantissa  bits  to  the  left  of  the  binary 
point  be  maintained  in  case  the  adds  result  in  an  increased  exponent.  If  n  adds  are  performed 
in  sequence,  log2n  extra  bits  must  be  maintained. 

In  some  cases,  the  use  of  an  automatic  block  exponent  can  increase  rounding  errors.  In  the 
above  example,  if  A  w  -B  and  |CI  <<  I A I ,  the  intermediate  result  is  badly  undemormalized 
and  valuable  bits  of  C  will  be  lost  when  it  is  aligned  with  the  original  exponent.  The  effect  is 
the  same  as  if  the  addition  were  performed  in  the  order  (A  +  C  ♦  B).  This  technique  treats 
floating  point  addition  as  if  it  were  associative  and  commutative  and  has  the  same  effect  as 
reordering  the  additions  to  give  the  largest  possible  rounding  error. 

Even  with  these  limitations,  automatic  block  exponent  is  a  very  effective  optimization.  Many 
computations  include  long  sequences  of  adds  (e.g.,  dot  products)  where  operand  ordering  is 
not  critical.  In  these  cases,  the  use  of  a  block  exponent  reduces  the  path  length  by  from  6n  to 
3n  +  3,  a  savings  of  50% ! 
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Shift  Combining 

Shift  combining  is  an  alternative  to  automatic  block  exponent  that  can  be  used  in  cases  where 
the  order  of  the  operations  must  be  preserved.  When  adding  three  or  more  floating  point 
numbers,  redundant  shifts  may  be  performed  when  a  mantissa  is  shifted  left  for  normalization 
and  then  immediately  shifted  right  for  alignment.  To  recognize  redundant  shifts,  the  mantissa 
left  shift  in  the  first  add  is  moved  below  the  branch  of  the  second  add.  This  requires  copying 
the  shift  into  both  paths  of  the  branch.  The  shift  will  be  eliminated  in  one  of  the  two  paths. 

The  following  code  fragment,  taken  from  the  compilation  of  A  +  B  +  C,  illustrates  this  tech¬ 
nique.  The  fragment  begins  after  the  B  and  C  mantissas  have  already  been  aligned  and  added. 
It  ends  after  the  final  mantissa  sum  is  computed  but  before  the  normalization. 
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If  the  branch  is  not 

taken,  the  shift  is  combined  with  the  alignment  right  shift.  An  additional  exponent  subtract 
is  required  to  calculate  the  shift  count.  If  the  branch  is  taken,  the  shifts  operate  on  different 
mantissas  and  cannot  be  combined.  The  path  length  of  the  optimized  code  is  unchanged,  but 
an  expensive  mantissa  shift  is  replaced  with  an  inexpensive  exponent  subtract. 


Post  Multiply  Normalization 

A  multiply  operation  can  denormalize  its  result  by  at  most  one  bit  position.  If  a  few  extra 
guard  bits  to  the  right  of  the  mantissa  are  maintained,  the  results  of  multiplication  can  be  used 
without  normalization  with  no  loss  of  accuracy.  Only  the  final  result  must  be  normalized.  For 
example,  the  code  for  A  *  B  *  C  is  shown  below. 
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This  optimization  also  handles  the  ubiquitous  case  of  multiply-add.  If  a  multiply  is  followed  by 
an  add,  its  normalization  can  be  eliminated  as  the  final  result  will  be  normalized  by  the  add. 

For  a  sequence  of  multiplies,  this  optimization  reduced  the  number  of  intructions  from  5 n  to 
2n  +  3,  a  savings  of  60%.  The  savings  in  terms  of  time  is  somewhat  less  since  the  mantissa 
multiply  M*  is  an  extremely  costly  operation. 


Conventional  Optimizations 

Decomposing  floating-point  operations  exposes  the  resulting  micro-operations  to  conventional 
compiler  optimizations  such  as  constant  folding,  common  subexpression  elimination,  and  dead 
code  elimination.  Consider  for  example,  the  expression  (A  +  B)*(A  -  B).  When  reduced  to 
micro-operations  the  alignment  of  A  and  B  can  be  recognized  as  a  common  subexpression  and 
eliminated.  The  optimization  reduces  the  path  length  from  17  to  15,  a  12%  improvement.  A 
source  level  compiler  can  find  no  common  subexpressions  and  will  perform  the  alignment  twice. 


Scheduling 

More  efficient  use  of  floating  point  hardware  can  be  made  by  scheduling  the  micro-operations 
of  an  entire  floating-point  expresion  as  a  unit  rather  than  scheduling  each  add  or  multiply 
separately.  The  pops  of  one  floating  point  operations  can  be  used  to  fill  idle  cycles  in  the 
evaluation  of  other  floating  point  operations  even  if  there  are  dependencies  between  the  two 
operations. 

Consider  for  example  the  case  of  a  multiply-add  (A  *  B  +  C).  A  reservation  table  for  this 
operation  is  shown  below.  Once  the  exponent  addition  for  the  multiply  is  completed  (A),  the 
exponent  subtract  for  the  add  may  be  performed  (C).  If  EA  +  EB  >  EC,  the  alignment  shift  for 
the  add  (D)  may  also  be  performed  in  parallel  with  the  multiply  (B).  In  a  conventional  floating 
point  unit,  the  multiply  has  to  complete  before  any  part  of  the  add  can  be  performed. 
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3  The  Micro-Optimizer 

An  experimental  micro-optimizer  has  been  implemented  to  evaluate  the  optimizations  described 
above.  The  program  accepts  a  restricted  LISP  expression  as  input  and  produces  optimized 
jiFPU  assembly  code  as  output. 
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The  compilation  is  performed  in  the  following  steps 

1.  The  expression  is  compiled  into  standard  three  address  macro  floating-point  assembly 
code. 

2.  A  data  flow  graph  is  constructed  and  used  to  recognize  (1)  sequences  of  cascaded  additions 
and  (2)  non-terminal  multiplies. 

3.  With  the  aid  of  the  data  flow  graph,  the  macro  assembly  code  is  translated  into  /iFP  code. 
Automatic  block  exponent  and  post  multiply  normalization  optmizations  are  performed 
during  this  step. 

4.  Shift  combining  is  performed  by  checking  each  shift  to  determine  if  its  result  is  used  as 
input  to  another  shift. 

5.  A  control  flow  graph  is  constructed  and  each  statement  is  labeled  with  an  identifier  spec¬ 
ifying  the  paths  that  pass  through  that  statement. 

6.  With  the  aid  of  the  control  flow  graph,  common  subexpression  elimination  is  performed. 
Expressions  are  eliminated  outside  of  basic  blocks  if  they  are  labeled  with  the  same  path 
identifier. 

7.  The  optimized  /iFP  code  is  scheduled  into  horizontal  microinstructions  using  a  greedy 
algorithm  that  schedules  an  operation  as  soon  as  its  inputs  and  required  resources  are 
available. 


4  A  Micro  Floating  Point  Unit 

A  micro  floating  point  unit  (/iFP)  is  required  to  efficiently  execute  the  code  produced  by 
the  micro-optimizer.  Micro-optimization  reduces  floating  point  operations  to  their  constituent 
integer  operations;  however  an  integer  processor  does  not  support  features  such  as  sticky  bits 
that  are  required  to  round  according  to  existing  standards  [2].  This  section  describes  the 
architecture  of  a  /iFP  suitable  to  execute  the  code  described  above.  The  purpose  of  this  design 
is  to  serve  as  a  basis  for  the  evaluation  made  in  Section  5.  This  description  is  a  paper  design, 
no  /zFPU  has  been  constructed. 

The  /tFP  contains  a  31-word  by  12-bit  exponent  register  file,  and  a  31-word  by  64-bit  mantissa 
file.  Each  register  file  has  two  read  ports  and  a  single  write  port.  The  exponent  registers  contain 
12-bit  2’s  complement  numbers.  These  numbers  are  converted  to/from  offset  format  during  load 
and  store  operations.  The  mantissa  registers  have  the  format  shown  below.  A  55-bit  nantissa 
(M)  includes  the  implied  bit  (I),  and  sign  bit  S.  The  mantissa  is  protected  above  by  three  A  bits 
and  below  by  three  B  bits  as  well  as  the  standard  guard,  round,  and  sticky  bits  (R). 
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The  A  bits  allow  up  to  four  aligned  mantissa  additions  to  be  performed  before  normalizing  the 
result.  The  possible  one-bit  overflows  are  accumulated  in  the  A  bits  for  later  normalization. 
The  B  bits  allow  up  to  four  multiplies  to  be  performed  before  normalizing.  The  bits  that  shift 
off  to  the  right  because  of  the  possible  one-bit  denormalization  are  accumulated  in  the  B  bits 
and  the  guard  bit. 

The  exponent  and  mantissa  data  paths  are  shown  in  Figure  1.  The  exponent  path  has  an 
adder/subtractor  and  can  receive  data  from  the  find-first-one  (FFl)  unit  in  the  mantissa  path. 
The  mantissa  path  includes  a  multiplier,  an  adder,  a  shifter,  and  a  find-first-one  unit  The 
multiplier,  adder,  shifter,  and  FFl  unit  are  pipelined  with  latencies  of  4,2,2,  and  2  (see  below). 
The  shifter  sets  the  sticky  bit  of  the  result  if  any  ones  are  discarded  from  the  right  side  of  the 
operand.  The  adder  uses  the  round  and  sticky  bits  to  round  each  addition.  The  multiplier  both 
produces  the  rounding  bits  and  uses  them  to  round  the  result. 

There  are  two  crossovers  between  the  exponent  and  mantissa  data  paths.  The  mantissa  shift 
is  controlled  by  an  exponent  shift  count,  and  the  find-first-one  unit  takes  a  mantissa  as  input 
and  produces  an  exponent  result. 

The  dock  cyde  is  determined  by  the  time  required  for  a  12- bit  exponent  add  («  15ns  in  a  l(i 
CMOS  technology).  Assuming  a  carry  lookahead  adder  and  a  Wallace-tree  multiplier  [4],  times 
for  mantissa'multiply,  add,  shift,  and  find-first-one  are  estimated  to  be  4,  2,  2,  and  2  cydes 
respectively.  A  register  read  or  write  takes  one  dock  cyde,  and  a  register  can  be  read  in  the 
same  cyde  it  is  written.  There  is  full  bypassing  under  compiler  control  (no  comparators). 

The  format  of  a  jiFP  instruction  is  shown  below.  Each  instruction  spedfies  sources  and  des¬ 
tinations  for  the  mantissa  and  exponent  register  files,  the  exponent  and  mantissa  operations, 
and  a  branch  specifier.  Specifying  a  register  address  of  all  ones  (OxlF)  selects  a  bypass  from 
the  result  bus.  Branches  have  no  delay  if  not  taken  and  a  one  cyde  dday  if  taken. 


Instruction  Format: 


|  EA  I  EB  I  EC  I  HA  I  MB  I  MC  I  EOP  |  MOP  I  BOP  I  BDST  I 
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The  units  perform  the  following  operations.  Each  unit  also  has  a  NOP  operation. 


Mantissa  Adder 


Figure  1:  /iFPU  Data  Paths 


Exponent  OPs 

E+,  E- 

Exponent  add /subtract  (EC  <-  EA  op  EB). 

FF1 

Returns  the  shift  required  to  normalize  mantissa  MA  (EC  <-  FF1 
MA) .  In  the  range  [-3,57].  Returns  the  largest  positive  number  if  no 
ones  are  found. 

LDE,  STE 

Load  or  store  exponent  as  an  integer. 

Mantissa  OPs 

M+,  M- ,  M* 

Mantissa  add,  subtract,  and  multiply  (MC  <-  MA  op  MB). 

SHR,  SHL 

Mantissa  right  and  left  shift  (MC  <-  MA  »  EA)  or  (MC  <-  MA  « 
EA) .  A  negative  exponent  shifts  in  the  opposite  direction. 

ABS,  HEG 

Zeros  and  complements  the  mantissa  sign  bit. 

LDM,  STM 

Load  or  store  mantissa  as  an  integer. 

LDF,  STF 

Load  or  store  mantissa  and  exponent  formatted  as  a  standard  float¬ 
ing  point  number. 

Branch  OPs  BR 

Unconditional  branch. 

BSEG 

Branch  on  exponent  negative  (EC  <  0). 

Bcond 

Branch  on  exponent  and  mantissa  compare  (EA,  MA)  relop  (EB, 
MB). 

This  instruction  set  is  the  minimum  required  to  perform  the  evaluation  in  the  next  section.  In 
certain  applications  additional  instructions  would  be  useful.  For  example,  if  divides  were  used 
frequently  a  mantissa  divide  M/  could  be  realized  with  an  SRT  divide  array.  If  divides  are  less 
frequent,  a  reciprocal  approximation  can  be  programmed  using  the  instructions  above. 

This  instruction  set  is  intended  to  complement  a  simple  integer  instruction  set  [7]  [1]  [6].  For 
operations  such  as  reciprocal  and  square  root  that  are  often  performed  using  Newton’s  method, 
there  is  no  need  to  implement  an  initial  approximation  lookup  table  in  the  /xFPU.  These  tables 
can  be  kept  in  main  memory  and  accessed  using  integer  instructions.  By  exposing  the  algo¬ 
rithms  for  reciprocal,  square  root,  and  other  floating-point  functions,  the  compiler  can  perform 
optimizations  that  are  not  possible  if  these  functions  are  hidden  in  microcode. 


5  Evaluation 


To  evaluate  micro-optimizations,  the  /iFPU  described  in  Section  4  is  compared  against  a  con¬ 
ventional  floating  point  unit  (cFPU)  with  the  same  micro-operation  times  and  register  file 
bandwidth.  The  two  units  were  compared  on  a  series  of  benchmark  expressions.  For  each 
expression  and  each  unit,  the  total  number  of  micro  operations  operation  count  and  the  total 
number  of  clock  cycles  required  time  to  execute  the  longest  path  through  the  expression  is 
measured. 

The  following  assumptions  are  made: 

•  The  two  units  have  identical  clock  rates  and  micro  operation  times. 
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•  Each  cycle,  each  unit  can  read  two  mantissas  and  two  exponents  and  write  one  mantissa 
and  one  exponent. 

•  All  units  are  pipelined  and  can  accept  a  new  input  each  cycle. 

•  Branches  have  no  delay  if  not  taken  and  a  delay  of  one  if  taken. 

•  Common  subexpression  elimination  is  performed  on  the  macro  floating  point  operations 
for  both  units. 

•  The  operations  on  each  unit  were  scheduled  using  a  greedy  algorithm. 


The  benchmarks  are  summarized  in  the  following  table: 


Description 
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Magnitude  of  Butterfly 
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8  Tap  FIR  Filter 

B 

m 

The  operation  counts  and  times  for  the  twelve  cases  are  tabulated  below  along  with  total  lengths 
and  times  for  the  two  units. 

Over  the  six  benchmarks,  micro-optimizations  resulted  in  a  33  %  reduction  in  operation  count 
and  a  40  %  reduction  in  time.  The  reductions  are  largest  for  large  expressions  with  long 
sequences  of  adds  or  multiplies. 

Expressions  with  a  great  deal  of  internal  parallelism  give  a  smaller  reduction  in  execution  time. 
The  parallelism  in  these  expressions  can  keep  a  conventional  floating  point  pipeline  very  busy 
reducing  the  advantage  gained  by  independently  scheduling  micro-operations.  For  example,  the 
FFT  butterfly  operation  (benchmark  7)  calculates  the  real  and  imaginary  components  of  its  two 
outputs  in  parallel.  A  pipelined  FPU  can  execute  these  four  calculations  in  parallel.  Because  the 
/iFPU  consumes  register  bandwidth  handling  intermediate  results,  it  cannot  initiate  operations 
as  quickly.  Because  of  the  register  bandwidth  bottleneck,  this  benchmark  has  a  typical  reduction 
in  operation  count  (30%),  but  only  a  25%  reduction  in  execution  time. 

Ail  benchmarks  other  than  number  7  show  a  greater  improvement  in  execution  time  than  in 
operation  count.  This  data  suggests  that  register  bandwidth  is  not  an  issue  for  most  scalar 
expressions.  The  two  units  were  compared  with  identical  and  realistic  register  file  bandwidth. 
Data  dependencies  prevent  the  conventional  FPU  from  exploiting  all  of  this  bandwidth.  If 
memory  bandwidth  is  equal  to  register  bandwidth,  a  conventional  FPU  will  outperform  a 
/iFPU  or.  vector  operations.  The  conventional  unit  can  start  an  operation  each  cycle  while  the 
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/iFPU  will  use  some  register  bandwidth  for  intermediate  results.  When  register  bandwidth  is 
at  least  twice  memory  bandwidth,  the  \i FPU  becomes  competitive  even  on  vector  operations. 


Operation  Count 

Benchmark 

cFrU 

/xFPU 

%  Reduction 

1 

16 

10 

38 

2 

18 

15 

17 

3 

15 

9 

40 

4 

33 

26 

21 

5 

27 

17 

37 

6 

82 

56 

32 

7 

94 

67 

29 

8 

82 

47 

43 

TOTAL 

367 

247 

33 

Time  (cycles) 

Benchmark 

cFPU 

MFPU 

%  Reduction 

1 

21 

13 

38 

2 

30 

19 

37 

o 

30 

16 

47 

4 

50 

31 

38 

5 

32 

17 

47 

6 

94 

52 

45 

7 

73 

55 

25 

8 

87 

47 

46 

TOTAL 

417 

250 

40 

6  Conclusion 


A  technique  for  micro-optimizing  floating-point  expressions  has  been  described.  Micro-optimization 
involves  reducing  floating-point  expressions  to  their  constituent  micro-operations  and  optimiz¬ 
ing  the  resulting  sequence.  By  exposing  the  micro-operations  to  the  compiler  many  redun¬ 
dant  operations  can  be  eliminated.  Scheduling  of  individual  micro-operations  allows  dependent 
macro  operations  to  be  partially  overlapped. 

An  evaluation  of  micro-optimization  shows  that  it  reduces  operation  count  by  33  %  and  exe¬ 
cution  time  by  40  %  compared  to  conventional  floating-point  execution.  The  operation  count 
reduction  is  largely  due  to  the  elimination  of  unecessary  normalization  operations.  Elimination 
of  common  exponent  subexpressions  contributes  a  small  amount.  The  improvement  in  execu¬ 
tion  time  is  due  to  the  elimination  of  these  operations  and  the  increased  overlap  of  operations 
resulting  from  scheduling  micro-operations  separately.  In  some  cases  exponent  calculations  are 
scheduled  in  such  a  manner  that  the  execution  time  is  entirely  due  to  mantissa  calculations. 
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A  micro  floating-point  unit  is  required  to  execute  these  floating-point  micro-operations.  Al¬ 
though  they  are  integer  operations,  appropriate  word  lengths  and  support  for  rounding  are 
required  to  maintain  accuracy.  Also,  separate  mantissa  and  exponent  paths  are  required  to 
give  performance  competitive  with  conventional  floating  point  units. 

A  /iFPU  breaks  the  pipeline  of  a  conventional  floating-point  unit  into  separately  scheduia- 
ble  function  units.  The  additional  scheduling  flexibility  can  be  exploited  through  micro¬ 
optimization.  The  penalty  for  this  separation  is  potentially  higher  register  file  bandwidth, 
higher  instruction  bandwidth  and  increased  control  complexity. 

The  flexibility  inherent  in  a  /iFPU  has  many  advantages  other  than  performance.  For  example, 
it  can  be  used  to  gracefully  support  high  precision  floating  point  numbers.  If  provision  is  made 
in  the  /iFPU  to  recover  the  low  bits  of  a  multiply  and  to  link  carry  bits  between  adds,  high- 
precision  floating  point  arithmetic  can  be  implemented  at  about  the  same  cost  as  high-precision 
integer  arithmetic. 

A  /iFPU  can  also  make  tradeoffs  between  area  and  performance.  For  example,  a  smaller  unit 
could  be  constructed  that  performs  mantissa  multiply  with  two  or  four  multiply  step  operations. 
The  resulting  unit  would  be  significantly  smaller  and  would  be  slower  only  in  those  cases  where 
two  mantissa- multiplies  can  be  overlapped. 

The  work  described  here  is  an  effort  to  integrate  floating-point  arithmetic  into  RISC  computer 
architecture  [8].  Conventional  RISCs  operate  with  a  scalar  and/or  vector  floating  point  unit  that 
is  operated  separately  from  the  RISC  pipeline.  A  /iFPU  integrates  floating  point  operations  into 
the  pipeline  so  that  only  one  execution  controller  is  required.  Floating-point  micro-operations 
are  handled  in  the  same  manner  as  integer  operations. 

Most  floating  point  calculations  are  limited  by  memory  bandwidth  rather  than  by  arithmetic 
capability.  By  integrating  floating-point  and  address  calculation  in  one  unit,  the  coupling 
between  the  FPU  and  the  memory  system  can  be  made  tighter.  For  example,  micro-operations 
can  be  used  to  fill  the  delay  slots  of  a  delayed  load.  Because  these  operations  are  scheduled 
by  the  compiler,  no  time  and  bandwidth  is  lost  synchronizing  data  arrival  with  a  separately 
scheduled  floating  point  pipeline. 

Much  work  remains  to  be  done  on  micro-optimizations.  Extending  the  expression  compiler  of 
Section  3  into  a  full  compiler  will  create  opportunities  for  additional  optimization.  For  example, 
loops  that  iterate  over  arrays  accumulating  a  running  sum  can  be  optimized  with  a  technique 
similar  to  automatic  block  exponent.  Other  optimizations  become  possible  if  the  compiler  is 
extended  to  infer  the  signs  and  relative  magnitudes  of  some  variables.  If  the  two  inputs  to  a 
mantissa  add  can  be  shown  to  have  the  same  sign,  the  result  will  not  be  denormalized  (it  may 
overflow  one  bit),  and  the  sign  of  the  result  can  be  inferred.  If  exponent  values  can  be  inferred 
or  computed  early,  block  exponents  can  be  applied  across  large  expressions.  If  the  relative 
magnitudes  of  exponents  can  be  inferred,  branches  on  exponent  comparison  can  be  eliminated. 

Floating  point  numbers  are  popular  because  they  free  the  programmer  from  the  tedious  task 
of  scaling  integers.  Scaling  need  not  be  performed  entirely  at  run-time  by  hardware,  however. 
A  suitable  division  of  effort  between  a  micro-optimizing  compiler  and  hardware  with  some 
primitive  support  for  floating  point  can  result  in  substantial  performance  improvement. 
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